Started this Blog to have a place to dump random notes that other people might find useful and so I can find it again later. 🙂
Kerberos, AFS and Batch Systems
(This is an overview I wrote in 2004. I would suspect that a lot of the technology is no longer relevant, but this article still got some occasional hits on my old homepage so I’m migrating it just in case it’s still useful for some people.)
Kerberos is a system for securely authenticating users in an unsecure network environment. It was developed in the 1980s at the MIT as part of the famous project Athena. During the 1990s Kerberos V5 was standardized in RFC 1510 and became widely used, especially after Microsoft decided to base Windows 2000 security on it. Within Kerberos, each user has a Ticket Granting Ticket (TGT) which can be used to acquire dedicated service-tickets. These service-tickets finally are used to authenticate a user to that service.
AFS (the “Andrew File System”) was started at the Carnegie-Mellon University as a research project. By using a slightly modified Kerberos V4 system they built a secure network filesystem which allows several data-storing servers and complex access control lists. Today AFS is used all over the world, especially by Universities and other research institutions. From the Kerberos point of view, AFS is just a service. AFS itself requires the user to own a valid Kerberos ticket for the service “AFS”, this is often called the “AFS-Token”.
A Batch System is needed for controlling resource usage of special machines like a supercomputer or some compute servers. Users just define what their jobs need (e.g. the number of CPUs), the Batch System decides if and when the user will get these resources. It then takes care of starting (and probably killing) the job, finally some accounting information about the job is saved for accounting or statistics. There are several commercial and non-commercial Batch Systems available these days, two important open source systems are Torque/OpenPBS and Sun’s Grid Engine.
This article is about the problems which arise, when these technologies are used together: Imagine a user submits a job which uses files from his home-directory in AFS. This requires the Batch System to make sure that the job has the users AFS-Token while running or the job would not be able to access the files. Making these files readable by anyone would allow the Batch System to not care about AFS, but this is not an option because of security considerations. Another problem is the limited lifetime of the AFS-Token, it has to be prolonged or renewed to allow the job access AFS continuously.
- MIT’s Kerberos homepage
- OpenAFS homepage (AFS became a commercial product that was sold by Transarc Inc. during the 1990s. Around the year 2000 IBM bought Transarc and eventually opened the AFS sources – this initiated the OpenAFS project.)
- Torque Resource Manager
- [Sun] Grid Engine
Kaserver / Kerberos V4 and AFS-Tokens
The Kaserver is a modified Kerberos V4 server used by the original AFS setup. It has some special features and speaks an additional network protocol (“Rx”) used only by AFS utilities. Its possible to use a Kerberos V4 server instead of the Kaserver (Using MIT’s Kerberos Server with AFS). The normal lifetime of an AFS-Token is configured per user on the server, usually its about 25 hours. The Token is isolated against the rest of the system in a Process Authentication Group (PAG) called structure which is identified by two unique group IDs.
Renewal of an AFS-Token
One possibility to provide a long lasting job with a token is to regain the token from time to time. The process doing this usually needs some knowledge about the users password though, either by asking him during job submission or by requiring him to store it somewhere.
One tool for this task is the “Password Storage and Retrieval System” (PSR), which uses asymmetric cryptography to securely store the users password encrypted in his AFS space. When a job needs to acquire a new token, the password gets decrypted and is used to simply request a new token. The secret key needed to decrypt the password is only stored on the machine that runs the job. If the user changes his password and updates the encrypted storage too, the new password is automatically used on the next renewal.
Another way to store the password is to use a “SRVTAB” file. Such a file is normally used to store a server key but it can also be used to store the key of a user. The stored key is not the plaintext password but some kind of hash. This way the password is not revealed, but be aware: Concerning Kerberos, the hash can be used just like the password. So when a job needs to acquire a new token, the hash can simply be used. You can find a description of this technology here: UMich: “How to run long lived jobs with AFS” Some quick hints to be used with KTH Kerberos V4: Create the SRVTAB like this: ksrvutil -f mysrvtab -c example.com add where example.com is the name of your AFS cell. Enter your username when prompted for “Name:”, the name of your AFS cell in uppercase for “Realm:” and just press enter for “Instance:” and “Version Number:”. It will then ask a password twice, enter your normal AFS password. The created file “mysrvtab” can be used like this: kauth -n myname -f mysrvtab bash where myname is your
username. kauth will run the given command (here: bash) and repeatedly renew the AFS-Token by using the secret from mysrvtab.
Prolongation of an AFS-Token
A completely different approach to extending the lifetime of AFS-Tokens is to prolong them, extend their lifetime without acquiring a new one.
To do this one has to extract the Token from the current environment and decrypt it with the AFS specific Kerberos service key (known only by the Kerberos server and the AFS fileservers). Its now possible to put a new timestamp into the Token, thereby extending its lifetime. After encrypting it with the service key again and putting it back into the users environment, the user has a Token with an extended lifetime. If this process is repeated regularly, the Token never expires.
To my knowledge this way was first gone at CERN in the first half of
the 1990s, they created the programs GetToken, SetToken and forge. These programs became the base of CERN’s “Authenticated Remote Control” (ARC) system, and some time later Codine and LoadLeveler evolved with support for these tools. (Codine is today known as Sun GridEngine, which still contains support for this method.)
A reimplementation of this method can be found in Mike Bechers OpenPBS module extension.
I did just another reimplementation, which does not rely on OpenAFS but uses only Kerberos and “krbafs”, found on any Fedora Core 1 machine.
Kerberos V5 and AFS-Tokens
Kerberos V5 brought some new features, renewable TGT-tickets being the most notable with regard to this article. Such a ticket can be renewed via kinit -R without the need to enter the password again. During a renewal, the kinit command contacts the Kerberos server and asks if the renewal is acceptable. This makes it possible to inhibit further usage of a stolen TGT by e.g. disabling the account on the server. As another security measure, a renewable ticket not only contains the usual (short) lifetime specification, but also features a (long) “renewable lifetime” that declares an upper limit for ticket renewal. Usually the normal lifetime is about a day, while the renewable lifetime can last for months.
Creating AFS-Tokens out of Thin Air
A completely different way to provide jobs with AFS-Tokens is to fake them. This is easy if you know the service key of the AFS service. (Remember: AFS is a Kerberos service, the AFS-Token is just a Kerberos V4 service ticket. Therefore it has a well known structure and is encrypted with the AFS’ service key.) An implementation of this method is available as GSSKLOG, a tool which uses the GSS-API to authenticate a client to the server, eventually giving back a faked AFS-Token if authentication succeeded.
Batch Systems: Required Steps for AFS-Integration
A complete solution would be to include full Kerberos and AFS support into the Batch System. But this would probably require considerable changes in network communication and internal structures, so some simpler way would be better.
When a job is submitted (e.g. qsub)
- Kerberos V4 with Renewal: Ask the user for his password or check for some prepared storage. Probably attach some information on it to the job.
- Kerberos V4 with Prolongation: Extract the users Token from the current PAG on the submit host and attach it to the job.
- Kerberos V5: Forward the TGT along with the job.
- Fake Tokens: The Batch System must become 100% sure about the users identity.
While the job is queued
- Kerberos V4: Do nothing.
- Kerberos V5: Renew the TGT repeatedly.
- Fake Tokens: Do nothing.
When the job is started
- Kerberos V4 with Renewal: Instantiate a PAG. Let Kerberos create a new TGT and a new AFS-Token.
- Kerberos V4 with Prolongation: Instantiate a PAG. Prolong the AFS-Token and insert it into the PAG.
- Kerberos V5: Instantiate a PAG. Let Kerberos create an AFS-Token (based on the TGT).
- Fake Tokens: Instantiate a PAG. Create a fake token and insert it into the PAG.
While the job is running
- Kerberos V4 with Renewal: Create a new TGT and a new AFS-Token repeatedly inside the PAG.
- Kerberos V4 with Prolongation: Repeatedly prolong the AFS-Token and insert it into the PAG.
- Kerberos V5: Renew the TGT repeatedly inside the PAG, let Kerberos create an AFS-Token each time.
- Fake Tokens: Repeatedly create a new fake token and insert it into the PAG.
When the job has finished
- Kerberos V4: Destroy all local knowledge about the users password and close the PAG.
- Kerberos V5: Close the PAG.
- Fake Tokens: Close the PAG.
To sum it up, a Batch System that wishes to support AFS has – at least be able to:
- call an external program on the submitting host when a job is submitted
- call an external program regularly while a job is queued
- allow to start processes by using an external program (a PAG-shell)
- run an external “shepher” program together with the users process OR call an external program regularly from an own “shepher” process
- pass some information along with the job and make this information available to the external programs
Not all described mechanisms need all these hooks though.
- MIT Longjobs: A patched version of OpenPBS which includes some Kerberos support to allow Kerberos authentication and long running job which access AFS.
- Kerberos on Wall Street: Paper about the Kerberos V4 to V5 migration of Morgan Stanley, includes some remarks about AFS.
- Globus and AFS
(C)opyright by Karsten Petersen <firstname.lastname@example.org> in 2004