[01:26:40] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) ```  pc2007  root@pc2007:~# fdisk -l Disk /dev/sda: 4.4 TiB, 4799217008640 bytes, 9373470720 sectors Units: sectors of 1 * 512 = 512 bytes Sector si...
[01:27:27] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul)
[01:29:22] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) a:05Papaul>03Banyek @Banyek all yours
[07:52:29] <jynus>	 hey, give a look if you have time to the debian package+puppet error on the new parsercaches (but it is ok if you don't have time, no high priority)
[07:54:06] <jynus>	 we may need to ping moritz as it could be a debian installer issue that affects other installs- but try to understand what happened there first (it could be a simpler mistake)
[08:05:44] <banyek>	 ok
[10:18:18] <wikibugs>	 10DBA, 10Operations: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) p:05Triage>03High
[10:21:02] <wikibugs>	 10DBA, 10User-Banyek: Reimage pc2006 with stretch - https://phabricator.wikimedia.org/T207934 (10jcrespo) 05Open>03declined We should work on T208383 instead.
[10:21:04] <Amir1>	 jynus: hey, Can you take a look at https://phabricator.wikimedia.org/T203709 ? Specially s8 on eqiad. It's blocking my work and Manual is not around for the next couple of weeks 
[10:21:14] <Amir1>	 if the codfw part can be done too, it would be amazing
[10:29:23] <wikibugs>	 10DBA, 10Operations: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Banyek) a:03Banyek
[10:34:22] <wikibugs>	 10DBA, 10Operations, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Banyek)
[10:34:24] <wikibugs>	 10DBA, 10Operations, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) a:05Banyek>03None
[10:36:06] <wikibugs>	 10DBA, 10Operations, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) a:03Banyek
[11:00:23] <jynus>	 Amir1: looking
[11:43:05] <Amir1>	 jynus: thanks
[11:50:56] <jynus>	 Amir1: I am sorry, but we have a blocker on that which is enabling gtid, that will take me some time to start with it
[11:51:36] <jynus>	 then this and next week we have limited availability, not only because manuel is out, there are holidays and other reasons
[11:52:02] <jynus>	 which means I will be able to work properly on that exactly on the week manuel comes back
[11:52:30] <jynus>	 by the time I am synced with understanding what I have to be done- those processes are higly manual
[11:52:38] <jynus>	 *to do
[11:52:54] <Amir1>	 okay :( 
[11:53:02] <Amir1>	 I wish I could help somehow
[11:53:05] <jynus>	 so we had several outages
[11:53:11] <jynus>	 and we finally solved those
[11:53:31] <jynus>	 but sadly now we have to catch up with everthing that wasn't done while I was taking care of those
[11:53:55] <jynus>	 and I hope you understand that "new features" have the following next priority over "fix broken things"
[11:54:38] <jynus>	 that doesn't mean they are forgotten, but I have to be realistic about time for that
[11:59:58] <Amir1>	 jynus: yeah, to be clear, it's not a new feature. It's fixing a very old issue :D
[12:00:38] <jynus>	 I know
[12:00:58] <jynus>	 but new feature for DBAs means == the server is not in fire
[12:01:03] <jynus>	 :-/
[12:02:05] <Amir1>	 haha, true
[12:31:42] <addshore>	 yay, no fires!
[12:32:00] <addshore>	 i mean, fires are fun and warm and all, but not on databases :P
[12:34:46] <jynus>	 addshore: I think I didn't fomally recognized the help you gave us during the wikidata issue
[12:35:05] <jynus>	 let me do it now, as I was at the time hurrying to fix it 
[12:36:15] <addshore>	 [= thanks, and thanks for knowing how on earth to fix it in the first place :)
[12:36:24] <addshore>	 I felt like everything was going on all at once that day
[12:36:35] <jynus>	 well, I made some mistakes like not updating the page timestamps
[12:36:57] <jynus>	 but I know now better than anyone what happens on wikidata edits :-)
[12:37:28] <jynus>	 also there may not be a next time as we are setting up integrity checks right now
[12:37:32] <addshore>	 hehe, indeed, well, I'm not sure how many people would have thought of that anyway :D It took me some hours to even think about it
[12:37:43] * addshore is looking forward to integrity checks
[12:37:57] <jynus>	 I actually thought about that-
[12:38:16] <jynus>	 but then I said- bah, of course it will look at the latest one, not based on the metadata
[12:38:32] <jynus>	 but of course it needs that because in some wikis the latests is not the default
[12:39:46] <jynus>	 we need more team cross-over to get better at what we are doing
[13:05:09] <Platonides>	 how were you thinking on latest, jynus?
[13:05:21] <Platonides>	 higher rev_id?
[13:05:52] <jynus>	 I don't know, I was just wrong :-D
[13:06:22] <Platonides>	 unlike traditional mediawiki, I don't really know how wikidata stores things
[13:06:31] <Platonides>	 but that looked like let's order by rev_id
[13:06:52] <Platonides>	 which then breaks in the odd case that some revisions were imported... :)
[13:09:46] <jynus>	 yeah, but it breaks other things too
[13:09:54] <jynus>	 users, diffs
[13:11:11] <jynus>	 maybe we should double the storage and store diffs in addition to revisions and aside for the obvious usages, we get things like blame possible
[13:11:43] <Platonides>	 I think diffs *are* stored
[13:11:47] <Platonides>	 ie. cached
[13:11:55] <Platonides>	 as they can be regenerated
[13:12:04] <jynus>	 cached != stored
[13:12:13] <jynus>	 our model works with revisions
[13:12:20] <Platonides>	 yes
[13:12:22] <jynus>	 although sometimes they are compressed
[13:12:29] <jynus>	 and actually stored as diffs
[13:12:31] <Platonides>	 or even in latin1 :)
[13:13:31] <jynus>	 tim suggested to maybe store in the future blame trees too to make blame possible
[13:14:09] <Platonides>	 I think the problem is to have a proper blame algorithm
[13:14:14] <jynus>	 I don't know, I was just throwing some ideas
[13:14:28] <Platonides>	 time ago, I tried just using git blame on articles
[13:14:33] <Platonides>	 but it breaks quite badly
[13:14:39] <Platonides>	 since articles aren't line-based
[13:14:43] <Platonides>	 as code often is
[13:15:12] <Platonides>	 once we have a proper blamer
[13:15:25] <jynus>	 taht is the least of the issues
[13:15:26] <Platonides>	 it could run very slowly, or get different caching layers
[13:15:30] <jynus>	 now make it efficient
[13:15:51] <jynus>	 there are pages with 20000 revisions or maybe 100K
[13:15:55] <Platonides>	 what good is a blame tool which doesn't work?
[13:16:16] <jynus>	 and they have to be each decompressed and retrieved from disk
[13:16:19] <Platonides>	 if I really care about who wrote XX, I may be ok with waiting an hour
[13:16:39] <jynus>	 if you are ok with waiting 1 hour, that can be cached and stored :-)
[13:17:18] <jynus>	 you just need to double the content storage, around 20 TB
[13:18:18] <jynus>	 (not having into account redundancy, extra caching chain, extra jobqueue resources)
[13:18:25] <jynus>	 I think it is doable
[13:18:40] <jynus>	 just not easy or cheap
[13:19:02] <Platonides>	 if it can be regenerated, maybe we don't need it to store it so much
[13:19:20] <Platonides>	 just a big-enough persistent cache could do
[13:19:29] <jynus>	 oh, it neads, even if you want it dynamic like the parsercaches
[13:19:52] <jynus>	 so only the most accessed articles get it updated
[13:20:43] <Platonides>	 not sure what you mean
[13:22:46] <jynus>	 we can store just a subset of the articles to "save" space, the same way only a subset of the editions are parsed into html and stored so our infra doesn't meltdown from rendering wikitext
[13:23:14] <jynus>	 (saving blame maps)
[13:23:31] <Platonides>	 I thought that most articles had their html in the parsercache
[13:26:11] <Platonides>	 well, the good thing about "caches" is that the most used ones get automatically prefered
[13:28:16] <jynus>	 Platonides: I see you with enough motivation to take on https://phabricator.wikimedia.org/T2639 :-)
[13:28:29] * Platonides reluctantly opens that link
[13:30:01] <Platonides>	 a 2004 task... :)
[13:31:02] <Platonides>	 not so long ago I removed a vandal text from an article: "* Se definió, por ejemplo, que el 60% de los hombres y el 33% de las mujeres participaron al menos en una práctica homosexual manifiesta hasta los 16 años de edad."
[13:31:06] <Platonides>	 https://es.wikipedia.org/w/index.php?title=Escala_de_Kinsey&diff=prev&oldid=111440482
[13:31:37] <Platonides>	 I searched through the history for 'blaming' it
[13:31:53] <Platonides>	 turned out to have been added 10 years ago by an ip :(
[13:31:55] <Platonides>	 https://es.wikipedia.org/w/index.php?title=Escala_de_Kinsey&diff=19048088
[14:31:25] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) a:05Banyek>03Papaul @Papaul as I checked the storage on the hosts it's set up for with stripe size of 512Kb instead of 256K (https://wikitech.wi...
[14:45:05] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10jcrespo) A larger stripe size should not be a huge issue (unlike a smaller one, which affected performance significantly and we didn't like it). We were thi...
[14:47:21] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) @jcrespo actually i can change the stripe size on one of the hosts, and do some comparison, what do you think about this?
[14:48:52] <jynus>	 ^ banyek- it requires a lengthy reconfiguration and a reimage, I am a bit worried about that
[14:49:15] <jynus>	 we are already a bit late with this, so I am unsure
[14:49:28] <jynus>	 certainly it is better to do it now that later
[14:50:14] <jynus>	 if you think it won't take you long I am ok with it, but we cannot spend a lot of time with that
[14:51:15] <jynus>	 e.g. if you can work on puppet on parallel with the reimage, go on
[14:51:55] <banyek>	 ok, but without that how do I know what are the metrics of "performance of the disk is acceptable" note: I worked with fusionIO drives in the past few years except where I had raid0 SSD :(
[14:52:15] <banyek>	 hm...
[14:52:24] <banyek>	 I can compare it with other hosts
[14:52:27] <banyek>	 ok, nm
[14:52:29] <banyek>	 I'll check
[14:52:55] <jynus>	 I really don't have an answer for that, I am just worried about spending too much time on that you understand?
[14:53:40] <jynus>	 with the disk sizes we are getting, it makes sense to increase the strip size anyway, so up to you :-)
[14:54:46] <jynus>	 I trust you with doing the right thing no matter what you decide :-)
[15:05:42] <banyek>	 actually I would left it as-is, but you brought this up
[15:05:46] <banyek>	 here's my proposal
[15:06:50] <jynus>	 no proposal on my side, do whatever you consider adequate as long as you don't spend much time on it :-D
[15:06:54] <banyek>	 I do/check/fix_if_needed the puppet part, and when I am done with it, but we still have time I can go for setting the strip size - when the puppet part is complete, no need to worry about reimaging, as it is pretty quick
[15:07:09] <jynus>	 cool to me
[15:07:45] <jynus>	 the only thing you may not know is that we were thinking to increase the strip size already
[15:07:55] <jynus>	 but never had the time to test it
[15:08:04] <jynus>	 and the RAID tool is horrible
[15:09:03] <jynus>	 banyek: do you remember which is the x1 host you setup on codfw?
[15:09:07] <jynus>	 do you remember the name?
[15:09:16] <banyek>	 db2096
[15:09:24] <jynus>	 thanks, you saved me some search
[15:09:30] <banyek>	 I glad!
[15:09:38] <banyek>	 *I am glad 
[15:11:30] <jynus>	 banyek: I forgot one last thing to ask you
[15:11:36] <jynus>	 about pc* hosts
[15:11:43] <banyek>	 go for it
[15:11:44] <jynus>	 disable the learning cycle if they have it
[15:11:49] <jynus>	 on the RAID controller
[15:12:02] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson)
[15:12:03] <jynus>	 so we prevent those to go write-through every 90 days
[15:12:31] <jynus>	 we do those manually instead of arbitrarily kill our performance :-)
[15:12:43] <banyek>	 ! I wrote it donw!
[15:12:54] <banyek>	 *down
[15:13:14] <jynus>	 I will check that on db2096 as I am checking it for unrelated reasons
[15:13:53] <banyek>	 good
[15:14:01] <banyek>	 I did not touched that part
[15:15:30] <jynus>	 actually, I am looking at documentation
[15:15:34] <jynus>	 and it may not be needed
[15:15:50] <jynus>	 there is a new mode "Auto-Learn Mode: Transparent"
[15:15:58] <jynus>	 which apparently doesn't impact performance?
[15:16:05] <jynus>	 so it may be no longer needed?
[15:16:21] <jynus>	 " In PERC H700 and previous, virtuals disks automatically switch to Write-Through mode when the battery charge is low because of a learn cycle. Once the battery charge is sufficient, Write-Back mode will be re-enabled."
[15:19:21] <jynus>	 I am asking if anyone knows about this and I am understanding it weel
[15:29:26] <jynus>	 banyek: ok so ignore my last comment for now, it may not be needed based on the above, sorry for the ping
[15:30:00] <banyek>	 ok
[15:30:20] <banyek>	 what do you think about this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470851/
[15:30:30] <banyek>	 it failed because of stretch, right?
[15:30:31] <jynus>	 looking
[15:31:27] <jynus>	 one thing, you keep editing line 1954, not sure if intentionally
[15:31:46] <jynus>	 it is good after your change, just in case you didn't notice it
[15:32:26] <jynus>	 but I would put it on a separate commit if you really want to fix it to not mix functionality
[15:33:06] <jynus>	 banyek: the reason it fails is "wmf-style: total violations delta 1"
[15:33:13] <jynus>	 because site.pp should not have parameters
[15:33:34] <banyek>	 yes, but if you see I just copied the pc2004 block
[15:33:43] <banyek>	 so that would fail too?
[15:33:49] <jynus>	 so the existing ones are wrong- maybe consider refactoring the existing ones first into a hiera parameter?
[15:34:04] <jynus>	 as I said, it needs a lot of puppet work :-)
[15:34:42] <jynus>	 aside from that, there are more changes that have to happen at the same time - I would add the 3 of them at the same time
[15:34:45] <banyek>	 yes, that was the point what I wanted to reach, so confirm if it is possible that the existing one violates a rule
[15:34:48] <banyek>	 thanks!
[15:34:52] <jynus>	 and there is some extra thing
[15:35:13] <jynus>	 that has to happen at the same time- prometheus monitoring
[15:35:32] <jynus>	 grep pc1004 to see all references to the parsercache hosts on puppet
[15:35:37] <jynus>	 to understand what I mean
[15:35:51] <jynus>	 I can help with the refactoring- not an easy one, ok?
[15:36:06] <jynus>	 I may be able to do that on friday- not sure
[15:36:51] <banyek>	 if find time for that I'd appreciate it, (but then drop me a message) if not, I'll work on it
[15:37:18] <jynus>	 sure, I add you as reviewer on every patch normally
[15:37:30] <jynus>	 that will CC you on every change
[15:37:44] <banyek>	 great