14:30:16 #startmeeting Pulp Triage 2017-08-18 14:30:16 !start 14:30:16 #info ttereshc has joined triage 14:30:16 Meeting started Fri Aug 18 14:30:16 2017 UTC and is due to finish in 60 minutes. The chair is ttereshc. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:30:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:30:16 The meeting name has been set to 'pulp_triage_2017_08_18' 14:30:16 ttereshc: ttereshc has joined triage 14:30:21 !here 14:30:21 #info daviddavis has joined triage 14:30:21 daviddavis: daviddavis has joined triage 14:30:24 elijah_d ok we can chat after triage. 14:30:26 #info mhrivnak has joined triage 14:30:26 !here 14:30:26 mhrivnak: mhrivnak has joined triage 14:30:36 !here 14:30:36 #info bizhang has joined triage 14:30:37 bizhang: bizhang has joined triage 14:30:38 !next 14:30:39 !here 14:30:40 #topic Unable to sync docker repo because worker dies - http://pulp.plan.io/issues/2966 14:30:40 #info bmbouter has joined triage 14:30:40 ttereshc: 3 issues left to triage: 2966, 2979, 2985 14:30:41 Issue #2966 [NEW] (unassigned) - Priority: Normal | Severity: High 14:30:42 Unable to sync docker repo because worker dies - http://pulp.plan.io/issues/2966 14:30:43 bmbouter: bmbouter has joined triage 14:30:44 mhrivnak, ok 14:30:49 !here 14:30:49 #info dkliban has joined triage 14:30:49 dkliban: dkliban has joined triage 14:31:00 !here 14:31:00 #info dalley has joined triage 14:31:00 dalley: dalley has joined triage 14:31:08 mhrivnak, were you talking about this issue with elijah_d ? 14:31:30 yeah, I see, so skip it for now? 14:31:38 Yes. I just want to try digging in a bit more to see if we can figure out where sigkill is coming from. 14:31:48 !propose skip 14:31:48 #idea Proposed for #2966: Skip this issue for this triage session. 14:31:48 ttereshc: Proposed for #2966: Skip this issue for this triage session. 14:31:55 try using strace on the pid receiving the sigkill 14:31:58 Yeah, I think that's fine. We'll look at it today. 14:32:11 !accept 14:32:11 #agreed Skip this issue for this triage session. 14:32:11 ttereshc: Current proposal accepted: Skip this issue for this triage session. 14:32:12 ttereshc: 2 issues left to triage: 2979, 2985 14:32:12 I don't think strace will work, because sigkill doesn't go to the process. 14:32:13 #topic Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD and mongo replica set are used - http://pulp.plan.io/issues/2979 14:32:13 Issue #2979 [NEW] (unassigned) - Priority: Normal | Severity: High 14:32:14 Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD and mongo replica set are used - http://pulp.plan.io/issues/2979 14:32:19 !here 14:32:19 #info ipanova has joined triage 14:32:19 ipanova: ipanova has joined triage 14:32:21 But there are some other kernel audit tricks that might help. :) 14:32:30 oh yeah sigkill goes to init 14:32:38 so I commented on this one 14:32:58 I did not comment that both katello and sat will be affected by this b/c they both use that MAX_TASKS... option 14:33:39 mhrivnak: i was trying to track down the sigkill with audit and so far no luck 14:33:40 fixing this will be a significant change to how our celery stuff runs, I think the only way is to have celery stop forking 14:34:00 ipanova ok, let's compare notes in a bit. 14:34:20 and we need one more piece of info, which is: is the postgresql client driver single threaded or not? 14:34:30 I'm not sure, but if it is then not only will pulp2 have this problem but also pulp3 14:34:49 349766 14:34:53 oops 14:34:55 Is there a recommendation we can make for being able to identify when one of them is deadlocked? 14:35:09 there is not unfortunately 14:35:19 perhaps they could count the threads 14:35:49 but that could also be unreliable since the thread count changes during operation 14:36:10 really you need a core dump to look at specifically how many pymongo threads are in existance 14:36:29 But if it makes it far enough past the fork to be spawning extra threads, has it effectively dodged the bullet? 14:37:09 Anyway, maybe that's getting into the weeds. 14:37:13 elijah_d++ 14:37:14 ichimonji10: elijah_d's karma is now 10 14:37:22 it has, but the problem is that before it makes it that far, you can't know if it will make it that far or if its deadlocked already 14:37:23 ichimonji10, thanks! 14:37:29 If there's a way to help people identify this in the mean time, that could mitigate the severity. 14:37:38 there isn't a reliable way 14:38:52 So options seem to be 1) make celery stop forking, 2) figure out how to delay database access during startup until post-fork, 3) wait for pulp 3? 14:39:18 not suggesting any of those are good or easy of course. :) 14:39:26 yeah but they are clear options 14:39:39 so I don't want to take this on the sprint, but I think we kind of need to 14:39:47 at least to understand if pulp3 will be affected or not 14:40:10 option (2) can't be done without rearchitecting our crash-recover scenarios 14:40:18 I suggest we accept it but not take it on this sprint. 14:40:34 I think we can get to the other side of the plugin api work and then look at this. 14:40:47 that is also what I want, but consider this 14:41:02 katello and sat are both enabling the MAX_TASKS_.... options 14:41:04 option 14:41:19 so they will experience rare deadlocking by doing that 14:41:41 that is the only thing that makes me think we should do more (even though I want to focus on the plugin API for pulp3) 14:41:42 ugg 14:42:36 Gotcha. If you want to pursue this on this sprint, that's fine with me. 14:42:55 so it looks like it was enabled in 6.2.7 14:43:00 It seems like any solution is likely to have a long timeline, but it doesn't hurt to start quickly. 14:43:14 it's been a while and no reports so far 14:43:23 whatever you all decide is fine w/ me 14:43:28 (I mean maybe we still can wait till the next sprint) 14:43:31 I wanted to just provide the scope of impact, etc 14:43:47 also I can advise about how to fix (stop forking) but I don't plan to take as assigned 14:43:50 either way 14:44:04 !propose accept and add to sprint 14:44:04 ttereshc: propose accept Propose accepting the current issue in its current state. 14:44:13 !propose other accept and add to sprint 14:44:13 #idea Proposed for #2979: accept and add to sprint 14:44:13 ttereshc: Proposed for #2979: accept and add to sprint 14:44:34 +1 14:45:03 !accept 14:45:03 #agreed accept and add to sprint 14:45:03 ttereshc: Current proposal accepted: accept and add to sprint 14:45:05 #topic I can create importers/publishers for any repo while targeting a specific repo URL - http://pulp.plan.io/issues/2985 14:45:05 ttereshc: 1 issues left to triage: 2985 14:45:06 Issue #2985 [NEW] (unassigned) - Priority: Normal | Severity: Medium 14:45:07 I can create importers/publishers for any repo while targeting a specific repo URL - http://pulp.plan.io/issues/2985 14:46:44 This one definitely needs fixing. 14:47:00 let's add it to the sprint? 14:47:05 I guess this issue is valid for any nested endpoints 14:47:09 yeah 14:47:10 That works for me. 14:47:15 not only importers 14:47:16 +1 14:47:18 +1 14:47:22 +1 14:47:42 accept, add to sprint, and comment about checking other nested urls 14:47:56 !propose other accept, add to sprint, and comment about checking other nested url 14:47:56 #idea Proposed for #2985: accept, add to sprint, and comment about checking other nested url 14:47:57 ttereshc: Proposed for #2985: accept, add to sprint, and comment about checking other nested url 14:48:03 !accept 14:48:03 #agreed accept, add to sprint, and comment about checking other nested url 14:48:03 ttereshc: Current proposal accepted: accept, add to sprint, and comment about checking other nested url 14:48:04 ttereshc: No issues to triage. 14:48:08 !end 14:48:08 #endmeeting