#foswiki 2013-09-24,Tue

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***gac410 has left [04:01]
................................. (idle for 2h41mn)
ChanServ sets mode: +o CDot [06:42]
............ (idle for 56mn)
ChanServ sets mode: +o MichaelDaum [07:38]
.... (idle for 15mn)
ChanServ sets mode: +o SvenDowideit [07:53]
.... (idle for 16mn)
ChanServ sets mode: +o MichaelDaum [08:09]
.... (idle for 17mn)
ChanServ sets mode: +o MichaelDaum [08:26]
....................................... (idle for 3h10mn)
ck-123Hi. I have a problem with the LDAPContrib which has a race condition somewhere causing a corrupt database. That results in random access denied errors and 100% CPU processes which wont stop (using FCGI). [11:36]
jastrace condition? during what? [11:37]
ck-123I set maxcache to 0, but after recreation of the db it does not seem to be "complete". So after a while I can see that two processes are adding groups/users at the same time which results in a corrupt database. The db_verify tool is quite handy to investigate that. [11:38]
jastright... in our setups we tend to have the DB recreated during the night in a cronjob [11:39]
ck-123My current "fix" (or better a first workaround) is that I included in LdapContrib.pm a check using db_verify. In case of a corrupted db, it is replaced with a valid one
That works.... for now. But can I avoid adding users/groups in general?
[11:39]
jastyou mean turn off caching altogether? [11:40]
ck-123Specific in our setup I guess is that we have ~500 groups and many useraccounts. next to that we have many requests. Both things raising the possibility of the race condition [11:41]
jastwell, it depends
we have the DB recreated only every 24h
hence we never get race conditions
[11:42]
ck-123No, not turning off caching, but just use the database readonly. And write it "manually" triggerd [11:42]
jastset the cache lifetime to a very high value
and write the DB in a cronjob. example:
[11:42]
ck-123We are doing that already...
a cron triggers the refresh once in the night and maxchacheage is set to 0
But in case of something missing in the database, there is an ldap call and an insertion of the user or group. And that breaks the database
[11:43]
jastoh, I see [11:45]
what confuses me about that we haven't seen this behaviour so far
+is
and some of our wiki instances are pretty big
(*and* active)
there is no explicit locking mechanism in LdapContrib, so I'm assuming that in our systems something else (BDB or the filesystem) is taking care of that and that isn't happening in your system for whatever versions
dang :)
*reasons
[11:51]
MichaelDaumck-123, hi
I see what you mean
this situation might happen when a user logs in but is yet unknown to the ldap cache
i.e. a lot of people log in, yet refreshldap has never been called
before
[11:55]
jastthe curious thing is that that happens frequently in our wikis but so far it has never screwed up the DB [11:57]
MichaelDaumin general a call to checkCacheForLoginName() should rarely be in a situation having to patch in a single user
jast, doesn't mean much. your q-wiki is quite different from most other foswikis
[11:58]
rf1What happens if not just the user is unknown to the cache, but at the same time the group the user belongs to?
Does that have an effect somewhere?
[11:59]
jastMichaelDaum: well, the LDAP part isn't that different :) [12:01]
ck-123Fortunately the automated replacement of a "valid" cache.db is keeping our instance stable, but I am looking for a "fix" [12:01]
jastrf1: group should get cached then, if I read the code correctly
if something looks at the group, that is
ck-123: perfectly understandable :)
[12:01]
MichaelDaumbasically two instances should never run into the situation building up a new cache.db
this only happens when they hit the site in the same microsecond and the test for cache.db_tmp fails for both
[12:02]
ck-123Yes, this is prevented (in our case) by setting the maxcacheage to 0. [12:03]
MichaelDaumif one instance detects it, an "already refreshing" message is displayed (in debug mode) [12:03]
ck-123But the nasty thing is the "update" of an existing cache. [12:04]
MichaelDaumsetting maxcache age to zero is the right thing to do on large wikis.
it prevents any foswiki being in the middle of rendering a page from having to precache the ldap cache at the same time
[12:04]
ck-123I can see in (extended) logfiles that two processes are adding the same user by triggering cacheUserFromEntry function [12:05]
MichaelDaumare you using TopicInteractionPlugin?
or is there any other rest call a single view could trigger?
[12:05]
ck-123No, I do not use that plugin [12:06]
MichaelDaumto create this bug, I'd craft a page that fires a couple of rest calls for the same user at the time of a view
ImagePlugin?
[12:07]
ck-123No, also not imageplugin, but I cannot answer your other question (about the rest call) [12:07]
MichaelDaumhave a look at your webserver's access log
and find out which two http calls happen from what client triggers these calls
could be something odd like a http redirect ... a mis-configured short-urls setup of some sort
[12:08]
ck-123uff... we reproduced the issue by calling Generic/WebHome by using ajax calls within an js [12:10]
MichaelDaumah ok so the situation _is_ artificial [12:13]
ck-123the reproduced one.
The one on the production system not
but behaviour is exactly the same
[12:13]
MichaelDaumdid you have issues on prod when zeroing maxcacheage? [12:14]
ck-123Yes.
Thats what I meant. It happens without recreating the database
it happens on "updating" in the function cacheUserFromEntry
[12:14]
MichaelDaumthen please look up the access.log to see which two http calls for a new user occur in such a millisecond timeframe
a kind of double punch
which knocks it out
btw which kind of filesystem is it the working/tmp directory of your foswiki is on?
[12:15]
ck-123ext4
hmm...
I can see in the access.log that just before the db gets corrupted there are always two view requests in the same second (but to different topics and from different IPs)
all other call in that second are like "GET /pub/System/PatternSkin/pattern.js HTTP/1.1" ...
[12:21]
MichaelDaumdo you use a viewfile protection of attachments? [12:27]
ck-123no, because that broke some plugin (IIRC the imagegalleryplugin.... or something like that) [12:28]
MichaelDaumthere are some comments in that respect at http://foswiki.org/Support.ApacheConfigGenerator [12:29]
ck-123What is the effect of using the preCache parameter. Will the database be "complete" in case it is switched on and recreated? [12:29]
MichaelDaumanyway: good that your get /pub's aren't hitting the ldap cache as well
that's important to nail down your issue
[12:29]
ck-123As I stated before... I guess in my situation two things come together. [12:30]
MichaelDaumso you said there are two view requests in the same second, coming from different IPs, hm. [12:31]
ck-1231.st many user requests.
2. Blown LDAP (which could delay ldap processing in some subroutine), which could increase the "dangerous" timeframe for a race condition
[12:31]
MichaelDaumthe danger is only a fraction of a millisecond when two processes create a cache.db_tmp file at the very same time while it isn't there on the filesystem yet
are there multiple web servers reusing the same filesystem including the working dir of foswiki?
[12:33]
ck-123Sorry Michael, but I think it breaks the db somewhere else.
The cacheUserFromEntry sub is called after a process ties the db
[12:35]
MichaelDaumokay [12:37]
ck-123but two processes are calling the cacheUserFromEntry within the same timeframe (proc a does not see user XY and calls cacheUserFromEntry/ proc B does not see user XY and calls cacheUserFromEntry)
Then in this sub two times the values are written to the tied hash
which breaks the db somehow.
I cannot prove this right now, but after my debugging with some timestamp outputs I am nearly sure
[12:37]
MichaelDaumI see [12:39]
***ChanServ sets mode: +o Babar
ChanServ sets mode: +o Babar
[12:39]
MichaelDaumdo you see "$loginName is unknown, need to refresh part of the ldap cache" before the crash? [12:40]
ck-123If I would remove all writing to the tied hash (exluding the full recreation)... would that break something? [12:40]
MichaelDaumnot that I can think of. good hot fix. [12:40]
ck-123No, I see a "WARNING: oops, no result looking for user XY in LDAP"
Or in case it breaks I can see it twice
[12:41]
MichaelDaumoh
sounds like the login name can't be found repeatedly no matter how hard the thing tries
[12:41]
ck-123ah no.... sorry
it is looking for group
but with the username
so, this has nothing to do with the probelm
[12:42]
MichaelDaumah thats in checkCacheForGroupName() [12:43]
ck-123so, to correct myself. It is that both processes are trying to find a missing user. Its not in the cache and both asking LDAP and both are inserting the stuff then in the DB. [12:44]
MichaelDaumif you switch on {Ldap}{Debug}, then there should be a "group is unknown need to refresh part of the cache" message right in front of the "oops no result looking for group"
a missing user or missing group
[12:45]
ck-123Yes, I inserted additional debug output, so I got sessionID and processID on each sub call (enter and leave)
so that I can see proc1 is still inserting something when proc2 is going to do the same stuff
[12:46]
MichaelDaumgotcha [12:47]
ck-123I saw that on recreating the cache by calling the view script with refresh parameter, the database does not have "all" groups/users
Is the preCache parameter used for that?
[12:49]
MichaelDaumyes
you must switch it on to create a proper cache.db with all users and groups in it
did you switch off precache?
[12:50]
ck-123In case of disabling adding values to the database while running, I would like to provide a proper cache.db to reduce the needed lookups (that are then no longer cached) [12:51]
MichaelDaumso there we are: disabling {PreCache} puts you in a high danger of corrupting the DB_File
and DB_File is known to be quite unfit for concurrent write access
in case you require an incremental build of the user base - switching off PreCache - then a different database backend is needed ... not DB_File
[12:51]
ck-123ok
Great.
[12:53]
MichaelDaum:( [12:54]
ck-123no, i mean at least I now have a plan
how to proceed
[12:54]
MichaelDaumthe real fix is to have one central real user&groups database table
with connectors to ldap or whatever other sources there are to fetch user infos
[12:54]
ck-123I can reduce the possibility by recreating the cache.db regularly with precache on, but to be 100% sure, I still need to disable all writing except of the complete refresh [12:55]
jastwell *in theory* you're supposed to query LDAP servers whenever you need their info
but the way foswiki's user mappings are currently structured that would be enormously wasteful in some situations
[12:55]
ck-123yes, but *in theory* you never have performance problenms [12:56]
MichaelDaumand *in theory* ldap servers never ever go down
neither facebook, twitter or google or whatever source for user records there are to put them into the cache
[12:56]
ck-123ok ok ok ok ..... is it worth it that I create a patch so that you can configure by a cfg parameter, if updating the DB is allowed? [12:58]
MichaelDaumck-123, definitely. very welcome. [12:58]
ck-123ok.... thank you guys very much. I will just fix the issue (and replace my workaround).... and then I try to attach this somewhere on foswiki.org [12:59]
MichaelDaumhere's the page to file a bug to LdapContrib http://foswiki.org/Tasks/LdapContrib
you might even reuse http://foswiki.org/Tasks/Item11632
[13:00]
***ChanServ sets mode: +o CDot
ChanServ sets mode: +o gac410
[13:06]
jastif an LDAP server goes down people can't authenticate anymore, anyway
so at least if the server uses LDAP apache auth, you're screwed, cache or no cache
[13:09]
MichaelDaumtrue
yet still there's not much of a way around a user data cache inside foswiki
given user registration builds upon an oauth call, you definitely need a safe place where to store infos
[13:14]
jastsure
though technically I wouldn't want that in a cache but in a DB :)
[13:18]
MichaelDaumright [13:18]
jastto me, calling something a cache when deleting it loses information is a bit questionable [13:19]
MichaelDaumthat's why the current DB_File "cache" inside LdapContrib is questionable as well [13:19]
jastin fact I think we had a case in one wiki where someone used refresh=force, unaware of the side effect of losing wikiname associations
*refreshldap
[13:19]
MichaelDaumwhile it _is_ a cache atm, foswiki needs more of a db ... which LdapContrib would connect to prefetching records
that is: removing the need of a cache.db inside LdapContrib
[13:20]
jastyeah, makes sense [13:20]
MichaelDaumonly if there was one central user db, would you be able to resolve name clashes where login names come from various sources (ldap, topics, oauth, ...)
for security reasons such a user database needs a long-term memory as well
to prevent id takeover
[13:21]
jastabsolutely [13:26]
............................................... (idle for 3h51mn)
seven_9Hello. Does anybody knows anything about this problem? I am having the same http://foswiki.2555947.n2.nabble.com/HELP-User-Registration-not-working-td7313464.html
(no js problems)
[17:17]
.............................. (idle for 2h25mn)
***seven_9 has left [19:43]
...... (idle for 29mn)
tsnfooDoes anybody know how to interface with strike one when you're generating client-side forms?
I can't find any good examples
[20:12]
........................ (idle for 1h58mn)
***ChanServ sets mode: +o Lynnwood [22:10]
.................... (idle for 1h36mn)
kornbluth.freenode.net sets mode: +oo gac410 Babar [23:46]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)