e'yo. General description of layout/goals/info/etc, and semi sortta api. That and aggregator of random ass crazy quotes should people get bored. [DISCLAIMER] This ain't the code. In other words, the actual design/code may be radically different, and this document probably will trail any major overhauls of the design/code (speaking from past experience). Updates welcome, as are suggestions and questions- please dig through all documentations in the dir this doc is in however, since there is a lot of info (both current and historical) related to it. Collapsing info into this doc is attempted, but explanation of the full restriction protocol (fex) is a _lot_ of info, and original idea is from previous redesign err... designs. Short version, historical, but still relevant info for restriction is in layout.txt. Other subsystems/design choices have their basis quite likely from other docs in this directory, so do your homework please :) [FURTHER DISCLAIMER] If this is implemented, which is being attempted, this is going to break backwards compatibility for existing python api. Writing a translation layer doesn't really seem possible either, since things are pretty radically different (imho) Sorry, but thems the breaks. Feel free to attempt it, but this docs author doesn't view it as a good use of time, since old api was useful for querying, about it. [Terminology] cp = category/package cpv = category/package-version ROOT = livefs merge point, fex /home/bharring/embedded/arm-target or more commonly, root=/ vdb = /var/db/pkg , installed packages database. domain = combination of repositories, root, and build information (use flags, cflags, etc). config data + repositories effectively repository = tree's. ebuild tree (/usr/portage), binpkg tree, vdb tree, etc. protocol = python name for design/api. iter() fex, is a protocol; it calls .next() on the passed in sequence, or wraps the sequence... hesitate to call it defined hook on a class/instance, but this (crappy) description should suffice. seq = sequence, lists/tuples set = list without order (think dict.keys()) [General design/idea/approach] So... this round of "lets take another stab at gutting portage" seems to be proceeding, and not killed off by design flaws *again* (famous last words I assure you), but general jist. All pythonic components installed by portage *must* be within portage.* namespace. No more polluting python's namespace, plain and simple. Third party plugins to portage aren't bound by this however (their mess, not ours). API flows from the config definitions, *everything* internal is effectively the same. Basically, config crap gives you your starter objects which from there, you dig deeper into the innards as needed action wise. The general design is intended to heavily abuse OOP, something portage has lacked thus far aside from the cache subsystem (cache was pretty much only subsystem that used inheritance rather then duplication). Further, delegation of actions down to components _must_ be abided by, example being repo + cache interaction. repo does what it can, but for searching the cache, let the cache do it. Assume what you're delegating to knows what the hell it's doing, and probably can do it's job better then some external caller (essentially). Actual configuration is pretty heavily redesigned. Look at conf_default_types (likely renamed at some point). It's the meta-definition of the config... sort of. Basically the global config class knows jack, is generic, and is configured by the conf_default_type. The intention is that the on disk configuration format is encapsulated, so it can be remote or completely different from the win.ini format advocated by harring (mainly cause it's easiest since a functional and mostly not completely obnoxious parser exists already). Encapsulation, extensibility/modularity, delegation, and allowing parallelizing of development should be key focuses in implementing/refining this high level design doc. Realize parallelizing is a funky statement, but it's apt; work on the repo implementations can proceed without being held up by cache work, and vice versa. Config is the central hold up, but that's mainly cause it instantiates everything, so the protocol for it needs to be nailed down so that you know what positiona/optional args your class/callable will be receiving (that said, positiona/optional is configurable via section definitions too). Final comment re: design goals, defining chunks of callable code and plugging it into the framework is another bit of a goal. Think twisted, just not quite as prevalent (their needs/focus is much different from ours, twisted is the app, your code is the lib, vice versa for portage). Back to config. Here's general notion of config 'chunks' of the subsystem, (these map out to run time objects unless otherwise stated) domain +-- profile (optional) +-- fetcher (default) +-- repositories +-- resolver (default) +-- build env data? | never actually instantiated, no object) \-- livefs_repo (merge target, non optional) repository +-- cache (optional) +-- fetcher (optional) +-- sync (optional) \-- sync cache (optional) profile +-- build env? +-- sets (system mainly). \-- visibility wrappers domain is configuration data, accept_(license|keywords), use, cflags, chost, features, etc. profile, dependant on the profile class you choose is either bound to a repository, or to user defined location on disk (/etc/portage/profile fex). Domain knows to do incremental crap upon profile settings, lifting package.* crap for visibility wrappers for repositories also. repositories is pretty straightforward. portdir, binpkg, vdb, etc. Back to domain. Domain's are your definition of pretty much what can be done. Can't do jack without a domain, period. Can have multiple domains also, and domains do *not* have to be local (remote domains being a different class type). Clarifying, think of 500 desktop boxes, and a master box that's responsible for managing them. Define an appropriate domain class, and appropriate repository classes, and have a config that holds the 500 domains (representing each box), and you can push updates out via standard api trickery. In other words, the magic is hidden away, just define remote classes that match defined class rules (preferably inheriting from the base class, since isinstance sanity checks will become the norm), and you could do emerge --domain some-remote-domain -u glsa on the master box. Emerge won't know it's doing remote crap. Portagelib won't even. It'll just load what you define in the config. Ambitious? Yeah, a bit. Thing to note, the remote class additions will exist outside of portage proper most likely. Develop the code needed in parallel to fleshing portage proper out. Meanwhile, the remote bit + multiple domains + class overrides in config definition is _explicitly_ for the reasons above. That and x-compile/embedded target building, which is a bit funkier. Currently, portage has DEPEND and RDEPEND. How do you know what needs be native to that box to build the package, what must be chost atoms? Literally, how do you know which atoms, say the toolchain, must be native vs what package's headers/libs must exist to build it? We need an additional metadata key, BDEPEND (build depends). If you have BDEPEND, you know what actually is ran locally in building a package, vs what headers/libs are required. Subtle difference, but BDEPEND would allow (with a sophisticated depresolver) toolchain to be represented in deps, rather then the current unstated dep approach profiles allow. Aside from that, BDEPEND could be used for x-compile via inter-domain deps; a ppc target domain on a x86 box would require BDEPEND from the default domain (x86). So... that's useful. So far, no one has shot this down, moreso, come up with reasons as to why it wouldn't work, the consensus thus far is mainly "err, don't want to add the deps, too much work". Regarding work, use indirection. virtual/toolchain-c <-- metapkg (glep37) that expands out (dependant on arch) into whatever is required to do building of c sources virtual/toolchain-c++ <-- same thing, just c++ virtual/autootols <-- take a guess. virtual/libc <-- this should be tagged into rdepends where applicable, packages that directly require it (compiled crap mainly) Yes it's extra work, but the metapkgs above should cover a large chunk of the tree, say >90%. [Config design] Portage thus far (<=2.0.51*) has had variable ROOT (livefs merge point), but no way to vary configuration data aside from via a buttload of env vars. Further, there has been only one repository allowed (overlays are just that, extensions of the 'master' repository). Addition of support of any new format is mildly insane due to hardcoding up the wing wang in the code, and extension/modification of existing formats (ebuild) has some issues (namely the doebuild block of code). Goal is to address all of this crap. Format agnosticism at the repository level is via an abstracted repository design that should supply generic inspection attributes to match other formats. Specialized searching is possible via match, thus extending the extensibility of the prototype repository design. Format agnosticism for building/merging is somewhat reliant on the repo, namely package abstraction, and abstraction of building/merging operations. On disk configurations for alternatives formats is extensible via changing section types, and plugging them into the domain definition. Note alt. formats quite likely will never be implemented in portage proper, that's kind of the domain of portage addons. In other words, dpkg/rpm/whatever quite likely won't be worked on by portage developers, at least not in the near future (too many other things to do). The intention is to generalize the framework so it's possible for others to do so if they choose however. Why is this good? Ebuild format has issues, as does our profile implementation. At some point, alternative formats/non-backwards compatible tweaks to the formats (ebuild or profile) will occur, and then people will be quite happy that the framework is generalized (seriously, nothing is lost from a proper abstracted design, and flexibility/power is gained). [config's actions/operation] portage.config.load_config() is the entrance point, returns to you a config object (portage.config.central). This object gives you access to the user defined configs, although only interest/poking at it should be to get a domain object from it. domain object is instantiated by config object via user defined configuration (/etc/portage/config namely). domains hold instantiated repositories, bind profile + user prefs (use/accept_keywords) together, and _should_ simplify this data into somewhat user friendly methods. (define this better). Normal/default domain doesn't know about other domains, nor give a damn. Embedded targets are domains, and _will_ need to know about the livefs domain (root=/), so buildplan creation/handling may need to be bound into domains. [Objects/subsystems/stuff] So... this is general naming of pretty much top level view of things, stuff emerge would be interested in (and would fool with). hesitate to call it a general api, but it probably will be as such, exempting any abstraction layer/api over all of this (good luck on that one }:] ). :IndexableSequence: functions as a set and dict, with caching and on the fly querying of info. mentioned due to use in repository and other places... (it's a useful lil sucker) This actually is misnamed. the order of iteration isn't necessarily reproducable, although it's usually constant. IOW, it's normally a sequence, but the class doesn't implicitly force it :LazyValDict: similar to ixseq, late loading of keys, on fly pulling of values as requested. :global config object (from portage.config.load_config()): #simple bugger. .get_types() # get the section types. .get_object() # return instantiated section .list_objects(type) # iterable, given a type, yield section names of that type. # and, just to make it damn fun, you can also access types via (fex, domain being a type) .domains # IndexableSequence, iterating == section labels, index == instantiate and return that section type convenience function in portage.config.* (somewhere) default_domain(config_obj) returns instantiated domain object of the default domain from config_obj. Nothing incredibly fancy, finds default domain via config._cparser.defaults().get("domain"), or via iterating over config.domain, returning the first domain that is root="/" what does the cparser bit map out to in the config? [DEFAULT] domain = some-section-label #the iterating route sucks, and will be a bit slower. default approach is recommended. :domain object: # bit of debate on this one I expect. # any package.{mask,unmask,keywords} mangling is instantiated as a wrapper around repository instances upon domain instantiation. # code _should_ be smart and lift any package.{mask,unmask,keywords} wrappers from repositoriy instances and collapse it, pointing # at the raw repo (basically don't have N wrappers, collapse it into a single wrapper). Not worth implementing until the wrapper is # a faster implementation then the current portage.repository.visibility hack though (currently O(N) for each pkg instance, N being # visibility restrictions/atoms). Once it's O(1), collapsing makes a bit more sense (can be done in parallel however). # a word on inter repository dependencies... simply put, if the repository only allows satisfying deps from the same repository, # the package instance's *DEPEND atom conversions should include that restriction. Same trickery for keeping ebuilds from depping on # rpm/dpkg (and vice versa). .repositories # in the air somewhat on this one. either indexablesequence, or a repositorySet. # nice aspect of the latter is you can just use .match with appropriate restrictions. very simply interface # imo, although should provide a way to pull individual repositories/labels of said repos from the set though. # basically, mangle a .raw_repo indexablesequence type trick (hackish, but nail it down when reach that bridge) .configure_package() # the contentious one. either a package instance returned from repositories is configured already, or it's returned # unconfigured, and passed through configure_package to get a configured_package. # it's slightly uglier for the resolver to have to do the .configure_package route, but note that the configuration # data _should not_ be bound to the repo (repo is just packages), it's bound to the domain. so... either that config # data is shoved down into some repo wrapper, or handled at this level. # that said, if it's at this level, it makes the .repositories.match() with use restrictions a bit funky in terms of # the "is it configured?" special casing... # debate and flesh this one out... :build plan creation: # Jason kindly chuck some details in here, lest harring makes a fool of himself trying to write out info on it himself... :sets: # Marius, kindly chuck in some details here. probably defined via user config and/or profile, although what's it define? # atoms/restrictions? itermatch might be useful for a true set :build/setup operation: (need a good name for this; dpkg/rpm/binpkg/ebuild's 'prepping' for livefs merge should all fall under this, with varying use of the hooks) .build() # do everything, calling all steps as needed .setup() # whatever tmp dirs required, create 'em. .req_files() # (fetchables, although not necessarily with url (restrict="fetch"...) .unpack() # guess. .configure() # unused till ebuild format version two (ya know, that overhaul we've been kicking around? :) .compile() # guess. .test() # guess. .install() # install to tmp location. may not be used dependant on the format. .finalize() # good to go. generate (jit?) contents/metadata attributes, or returns a finalized instance :repo change operation: # base class. .package # package instance of what the action is centering around. .start() # notify repo we're starting (locking mainly, although prerm/preinst hook also) .finish() # notify repo we're done. .run() # high level, calls whatever funcs needed. individual methods are mainly for ui, this is if you don't display # "doing install now... done... doing remove now... done" stuff. :remove operation: # derivative of repo change operation .remove() # guess. .package # package instance of what's being yanked. :install operation: # derivative of repo change operation .package # what's being installed. .install() # install it baby. :merge operation: # derivative of repo remove and install (so it has .remove and .install, which must be called in .install and .remove order) .replacing # package instance of what's being replaced. .package # what's being installed :fetchables: # basically a dict of stuff jammed together, just via attribute access (think c struct equiv) .filename .url # tuple/list of url's. .chksums # dict of chksum:val :fetcher: # hey hey. take a guess. # worth noting, if fetchable lacks .chksums["size"], it'll wipe any existing file. if size exists, and existing file is bigger, # wipe file, and start anew, otherwise resume. # mirror expansion occurs here, also. .fetch(fetchable, verifier=None) # if verifier handed in, does verification. :verifier: # note this is basically lifted conceptually from mirror_dist. if wondering about the need/use of it, look at that source. verify() # handed a fetchable, either False or True :repository: # this should be format agnostic, and hide any remote bits of it. # this is general info for using it, not designing a repository class .mergable() # true/false. pass a pkg to it, and it reports whether it can merge that or not. .livefs # boolean, indicative of whether or not it's a livefs target- this is useful for resolver, shop it to other repos, binpkg fex # prior to shopping it to the vdb for merging to the fs. Or merge to livefs, then binpkg it while continuing further building # dependant on that package (ui app's choice really). .raw_repo # either it weakref's self, or non-weakref refs another repo. why is this useful? visibility wrappers... # this gives ya a way to see if p.mask is blocking usable packages fex. useful for the UI, not too much for # portagelib innards. .frozen # boolean. basically, does it account for things changing without it's knowledge, or does it not. frozen=True is faster # for ebuild trees for example, single check for cache staleness. frozen=False is slower, and is what portage does now # (meaning every lookup of a package, and instantiation of a package instance requires mtime checks for staleness) .categories # IndexableSequence, if iterated over, gives ya all categories, if getitem lookup, sub-category category lookups. # think media/video/mplayer .packages # IndexableSequence, if iterated over, all package names. if getitem (with category as key), packages of that category. .versions # IndexableSequence, if iterated over, all cpvs. if getitem (with cat/pkg as key), versions for that cp .itermatch() # iterable, given an atom/restriction, yields matching package instances. .match() # def match(self, atom): return list(self.itermatch(atom)) # voila. .__iter__() # in other words, repository is iterable. yields package instances. .sync() # sync, if the repo swings that way. # flesh it out a bit, possibly handing in/back ui object for getting updates... digressing for a moment... note you can group repositories together, think portdir + portdir_overlay1 + portdir_overlay2. Creation of a repositoryset basically would involve passing multiple instantiating repo's, and depending on that classes semantics, it internally handles the stacking (right most positional arg repo overrides 2nd right most, ... overriding left most) So... stating it again/clearly if it ain't obvious, everything is configuration/instantiating of objects, chucked around/mangled by the portagelib framework. What _isn't_ obvious is that since a repository set gets handed instantiated repositories, each repo, _including_ the set instance, can should eb able to have it's own cache (this is assuming it's ebuild repos through and through). Why? Cache data doesn't change for the most part exempting which repo a cpv is from, and the eclass stacking. Handled individually, a cache bound to portdir _should_ be valid for portdir alone, it shouldn't carry data that is a result of eclass stacking from another overlay + that portdir. That's the business of the repositoryset. Consequence of this is that the repositoryset needs to basically reach down into the repository it's wrapping, get the pkg data, _then_ rerequest the keys from that ebuild with a different eclass stack. This would be a bit expensive, although once inherit is converted to a pythonic implementation (basically handing the path to the requested eclass down the pipes to ebuild*.sh), it should be possible to trigger a fork in the inherit, and note python side that multiple sets of metadata are going to be coming down the pipe. That should alleviate the cost a bit, but it also makes multiple levels of cache reflecting each repository instance a bit nastier to pull off till it's implemented. So... short version. Harring is a perfectionist, and says it should be this way. reality of the situation makes it a bit trickier. Anyone interested in attempting the mod, feel free, otherwise harring will take a crack at it since he's being anal about having it work in such a fashion. Or... could do thus. repo + cache as a layer, wrapped with a 'regen' layer that handles cache regeneration as required. Via that, would give the repositoryset a way to override and use it's own specialized class that ensures each repo gets what's proper for it's layer. Think raw_repo type trick. continuing on... :cache: # ebuild centric, although who knows (binpkg cache ain't insane ya know). # short version, it's functionally a dict, with sequence properties (iterating over all keys). .keys() # return every cpv/package in the db. .readonly # boolean. is it modifiable? .match() # flesh this out. either handed a metadata restriction (or set of 'em), or handed dict with equiv info (like the former). # ebuild caches most likely *should* return mtime information alongside, although maybe dependant on readonly. # purpose of this? Gives you a way to hand off metadata searching to the cache db, rather then the repo having to resort # to pulling each cpv from the cache and doing the check itself. # This is what will make rdbms cache backends finally stop sucking and seriously rocking, properly implemented at least. :) # clarification, you don't call this directly, repo.match delegates off to this for metadata only restrictions :package: # this is a wrapped, *constant* package. configured ebuild src, binpkg, vdb pkg, etc. # ebuild repositories don't exactly and return this- they return unconfigured pkgs, which I'm not going to go into right now # (domains only see this protocol, visibility wrappers see different) .depends # usual meaning. ctarget depends .rdepends # usual meaning. ctarget run time depends. seq, .bdepends # see ml discussion. chost depends, what's executed in building this (toolchain fex). seq. .files # get a better name for this. doesn't encompas files/*, but could be slipped in that way for remote. encompasses # restrict fetch (files with urls), and chksum data. seq. .description# usual meaning, although rememebr probably need a way to merge metadata.xml lond desc into the more mundane description key. .license # usual meaning, seq. .homepage # usual. needed? .setup() # name sucks. gets ya the setup operation, which does building/whatever. .data # raw data. may not exist, don't screw with it unless you know what it is, and know the instance's .data layout. :restriction: # see layout.txt for more fleshed out examples of the idea. # note, match and pmatch have been reversed namewise. .match() # handed package instance, will return bool of whether or not this restriction matches. .pmatch() # 'optimized' interface. basically, you hand it the data it tests, rather then a package instance. # basically, for a category restriction (fex), def match(self, p): return self.pmatch(p.category) # get the jist? it holds the actual test, if the restriction can be collapsed down to a test of a single value. # .pmatch calling should only be done by self, or code that does isinstance checks and hands off data itself rather then # doing an instantion. .pmatch is exposing a bit of the match protocol so that optimizations are potentially possible in # executing match checks (think repo.match). this one is still slightly up in the air. naming sucks also. .itermatch() # new one, debatable. short version, giving a sequence of package instances, yields true/false for them. # why might this be desirable? if setup of matching is expensive, this gives you a way to amoritize the cost. # jason/marius/alec, thoughts? # might have use for glsa set target. define a restriction that limits to installed pkgs, yay/nay if update is avail... :restrictionSet: # mentioning it merely cause it's a grouping (boolean and/or) of individual restrictions # an atom, which is in reality a category restriction, package restriction, and/or version restriction is a # boolean and set of restrictions :ContentsRestriction: # whats this you say? a restriction for searching the vdb's contents db? Perish the thought! ;) :metadataRestriction: Mentioning this for the sake of pointing out a subclass of it, DescriptionRestriction- this will be a class representing matching against description data. See repo.match and cache.match above. The short version is that it encapsulates the description search (a *very* slow search right now) so that repo.match can hand off to the cache (delegation), and the cache can do the search itself, however it sees fit. So... for the default cache, flat_list (19500 ebuilds == 19500 files to read for a full searchDesc), still is slow unless flat_list gets some desc. cache added to it internally. If it's a sql based cache, the sql_template should translate the query into the appropriate select statement, which should make it *much* faster. Restating that, delegation is *absolutely* required. There have been requests to add intermediate caches to the tree, or move data (whether collapsing metadata.xml or moving data out of ebuilds) so that the form it is stored is in quicker to search. These approaches are wrong. Should be clear from above that a repository can, and likely will be remote on some boxes. Such a shift of metadata does nothing but make repository implementations that harder, and shift power away from what knows best how to use it. Delegation is a massively more powerful approach, allowing for more extensibility, flexibility and *speed*. Final restating- searchDesc is matching against cache data. The cache (whether flat_list, anydbm, sqlite, or a remote sql based cache) is the *authority* about the fastest way to do searches of it's data. Programmers get pist off when users try and tell them how something internally should be implemented- it's fundamentally the same scenario. The cache class the user chooses knows how to do it's job the best, provide methods of handing control down to it, and let it do it's job (delegation). Otherwise you've got a backseat driver situation, which doesn't let those in the know, do the deciding (cache knows, repo doesn't). Mind you not trying to be harsh here. If in reading through the full doc you disagree, question it; if after speeding up current cache implementation, note that any such change must be backwards compatible, and not screw up the possibilities of encapsulation/delegation this design aims for. :logging: flesh this out (define this basically). short version, no more writemsg type trickery, use a proper logging framework. [ebuild-daemon.sh] Hardcoded paths *have* to go. /usr/lib/portage/bin == kill it. Upon initial loadup of ebuild.sh, dump the default/base path down to the daemon, *including* a setting for /usr/lib/portage/bin . Likely declare -xr it, then load the actual ebuild*.sh libs. Backwards compatibility for that is thus, ebuild.sh defines the var itself in global scope if it's undefined. Semblence of backwards compatibility (which is actually somewhat pointless since I'm about to blow it out of the water). Ebuild-daemon.sh needs a function for dumping a _large_ amount of data into bash, more then just a line or two. For the ultra paranoid, we load up eclasses, ebuilds, profile.bashrc's into python side, pipe that to gpg for verification, then pipe that data straight into bash. No race condition possible for files used/transferred in this manner. A thought. The screw around speed up hack preload_eclasses added in ebd's heyday of making it as fast as possible would be one route; Basically, after verification of an elib/eclass, preload the eclass into a func in the bash env. and declare -r the func after the fork. This protects the func from being screwed with, and gives a way to (at least per ebd instance) cache the verified bash code in memory. It could work surprisingly enough (the preload_eclass command already works), and probably be fairly fast versus the alternative. So... the race condition probably can be flat out killed off without massive issues. Still leaves a race for perms on any files/*, but neh. A) That stuff shouldn't be executed, B) security is good, but we can't cover every possibility (we can try, but dimishing returns) A lesser, but still tough version of this is to use the indirection for actual sourcing to get paths instead. No EBUILD_PATH, query python side for the path, which returns either '' (which ebd interprets as "err, something is whacked, time to scream"), or the actual path. In terms of timing, gpg verification of ebuilds probably should occur prior to even spawning ebd.sh. profile, eclass, and elib sourcing should use this technique to do on the fly verification though. Object interaction for that one is going to be *really* fun, as will be mapping config settings to instantiation of objs.