Bulk deletion [#367753]

Some delete operations require more than a simple DELETE query; e.g. for image fields, a file needs to be removed from the filesystem. For individual object delete operations (e.g. field_attach_delete()), field data is removed from the database immediately, the field type's delete hook is invoked immediately, and any module's hook_field_attach_delete is invoked immediately.

For bulk operations, this doesn't work. Deleting a field instance, a bundle, or an entire field can result in an enormous number of required delete operations. As explained above, these delete operations may require more than a simple DELETE query and thus could potentially take a long time (for example, longer than the PHP execution time limit). Perhaps we could make delete operations "multiple" like loads but there might be as many as O(10^7) individual entities affected and we do not want to make a single in-memory array that large.

Therefore, field config, instance, and data tables now have a "deleted" column to mark data as deleted. When bulk delete operations are performed (e.g. delete an instance, field, or bundle), Field API simply marks all the affected rows as deleted without actually deleting them. field_attach_load() ignores all data marked deleted, so it appears to users of the Field API that the data is actually gone.

However, the data is not gone, and the various delete operations have not yet been invoked. We need a batched delete process to run later. This issue is for implementing the batch process.

I thought about this earlier this month and found some complexities. At first we were using the deleted column for *all* deletions, not just bulk deletions. However, we were also (e.g.) invoking hook_field_attach_delete() for individual deletes. This led to a conflict because, during the (yet unimplemented) batch delete, we would not know whether a deleted row was the result of an individual or bulk delete and thus whether hook_field_attach_delete() had already been called for it, and thus risked calling it twice.

I also realized that, as specified, hook_field_attach_delete() accepts the object being deleted. This implies that the object still needs to *exist* at the time the batch delete GC takes place. However, for example, node.module does not implemented a delayed delete, so when a field is removed from a node and later the field batch delete is invoked, the node object is no longer available to load in order to pass to hook_field_attach_delete()! So it may prove to be the case that making Field API really support batch delete requires all of core to switch to this model.

So, currently, Field API just marks bulk-deleted data as deleted but does not actually delete it, ever. You might think this is a problem, but actually Field API is doing better than (e.g.) node module which, when bulk deletes occur, simply orphans a whole bunch of data in the database!

The bottom line is Drupal has a lot of work to do to get this exactly right.

Comment	File	Size	Author
#51	field-bulk-delete-367753-51.patch	63.08 KB	bjaspan
#51
#48	field-bulk-delete-367753-48.patch	60.93 KB	bjaspan
#48
#47	field-bulk-delete-367753-47.patch	58.7 KB	yched
#47
#46	field-bulk-delete-367753-46.patch	58.69 KB	yched
#46
#45	field-bulk-delete-367753-45.patch	61.12 KB	bjaspan
#45
#43	field-bulk-delete-367753-43.patch	60.31 KB	bjaspan
#43
#39	field-bulk-delete-367753-39.patch	51.29 KB	yched
#39
#38	field-bulk-delete-367753-38.patch	51.25 KB	bjaspan
#38
#35	field-bulk-delete-367753-35.patch	49.31 KB	bjaspan
#35
#33	field-bulk-delete-367753-33.patch	59.1 KB	bjaspan
#33
#17	field-bulk-delete-367753-17.patch	46.31 KB	bjaspan
#17
#13	field-bulk-delete-367753-11.patch	43.12 KB	bjaspan
#13
#11	field-bulk-delete-367753-11.patch	43.12 KB	bjaspan
#11
#10	field-bulk-delete-367753-10.patch	25.39 KB	bjaspan
#10
#5	field-bulk-delete-367753-5.patch	10.68 KB	bjaspan
#5

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

bjaspan CreditAttribution: bjaspan commented 2 February 2009 at 03:28

When we get this sorted out (or sooner), we need to document all of this.

Comment #2

David_Rothstein CreditAttribution: David_Rothstein commented 13 April 2009 at 21:16

Subscribe.

The field deletion problem makes my head hurt every time I think about it :)

Comment #3

andrewlevine CreditAttribution: andrewlevine commented 30 April 2009 at 16:13

I can't claim to completely grok FieldAPI so please correct me where I'm wrong.

Browsing through the code, I noticed that every field_attach_* API function and every hook_field_storage_* implementation requires the use of field_attach_extract_ids() to retrieve structured data from the amorphous $object argument. This makes sense because they need $id, $vid and $bundle to be useful, but this begs the question of why they are not formally part of the API (in the function definition) to begin with.

I might be missing something, but it seems to me if we can break out this required "key data" from the amorphous $object argument into the API (function definitions), we could avoid this problem by not including $object argument at all in the functions that do not insert/update data (_load, _load_revision, _delete, etc).

For example, it might work something like this:
(ignoring the multiple load problem for now)
function field_attach_load($bundle, $obj_type, $id, $vid = NULL, $cacheable = TRUE);
function field_attach_delete($bundle, $obj_type, $id, $vid = NULL, $cacheable = TRUE);
function field_attach_insert($bundle, $obj_type, $object, $id, $vid = NULL, $cacheable = TRUE);
function field_attach_update($bundle, $obj_type, $object, $id, $vid = NULL, $cacheable = TRUE);

So now we only use $object for "real" data, not as a structure that also holds keys used for internal use.

This change would have these positive effects:
1. The API is much clearer about what is required in the different field_attach_* functions (no guessing about what needs to be in an $object).
2. The hook_field_attach functions that don't insert/update data are intentionally prevented from relying on $object data that may or may not be available in any given circumstance (for example, it would prevent the author of a field module from writing in dependencies on a certain property of $object being available that is provided by another field that could be later removed by an administrator at any time)
3. hook_fieldable_info doesn't feel like a "normal" drupaly hook being called time after time from field_attach_extract_ids() and used just to parse a structured object. Since we no longer have to use it to describe the structured data in $object, it can be used to expose data as one would expect (eg the human readable name and bundles keys)
4. We would then be able to delete field data normally when the object doesn't exist. Actually, we would be able to delete the field even if the row in field_config didn't exist, since we'd just need to pass in the "key data" that is saved in field_config_instance and the field_data_X tables without $object_loading or field_attach_loading anything.
5. The delete of a single object could then be implemented without touching any fields at all (until bulk-delete). For example, say we're Ubuntu and we have distro-release nodes that contain DVD ISOs. A node_delete of a distro-release would simply do the regular node_delete stuff and mark fields deleted. Then, cron comes back later and deletes the ISO that may have caused the delete UI to stall for multiple seconds.

Does this make any sense? Am I missing something that requires us to use $object?

Comment #4

bjaspan CreditAttribution: bjaspan commented 7 June 2009 at 03:08

Status:

Active

» Needs work

I'm attaching some very much work-in-progress code, then suspending the issue until we completely revamp one aspect of Field API.

I've been struggling with how to implement the bulk deletion cleanup code on and off for months (in fact I've tortured several interviewees at Acquia with the problem hoping they'd figure out something I was missing!). Here are some of the questions and answers I've come up with:

* How do we determine the object ids that have marked-deleted data? Well, we can SELECT WHERE deleted = 1, but only the storage engine knows how to do that. So we need another hook into the storage engine. In my patch, I use field_attach_query() for that, by making it allow 'deleted' to be a condition.

* Cleaning up properly requires calling the same hooks that would normally be called on field_attach_delete(). But those hooks all expect a $object, and Field API has no way to create (for example) a node object that needs to be cleaned up, because (a) it can't call node_load() (not every object type has a common load method (yet)), and (b) even if it could, the node might have been deleted and not exist anymore! Solution: the realization that we do not need a full object, we just need a $object with the fieldapi-relevant properties. The pseudo-objects created along with the field_attach_query() object meet these criteria perfectly.

* In order to properly clean up, a hook_field_delete() may need all of the data that the field would normally on field_attach_load(). For example, filefield needs the file path that is stored in the database so it knows what file to delete. So, $object->filefield[0]['filepath'] needs to be set. Solution: field_attach_query(FIELD_QUERY_RETURN_VALUES).

Clearly, the field_attach_query() patch was a real breakthrough for this. But I've hit a snag. Consider this excerpt in which I retrieve all deleted instances and then, one at a time, query for all the pseudo-objects with marked-deleted data in that field:

function field_purge() {
  // Retrieve all deleted field instances.
  $instances = field_read_instances(array('deleted' => 1), array('include_deleted' => 1, 'include_inactive' => 1));

  foreach ($instances as $instance) {

    // Retrieve some pseudo-objects along with field data for the
    // instance.
    //
    // TODO: We can't query by $instance['field_name'] because the
    // field and/or instance has been deleted and another with the
    // same name may exist.  We have to query by field or instance id,
    // but the entire Field API is based on field name.  Now we need
    // to change that.
    $obj_types = field_attach_query($instance['field_name'], array('bundle' => $instance['bundle'], 'deleted' => 1), 10, FIELD_QUERY_RETURN_VALUES);

The problem is that $instance['field_name'] does not uniquely identify the deleted instance. As committed in #364620: Allow creating a field with a deleted field's name, deleted instances (and fields) have to be identified by their id. But field_attach_query(), along with all the field_attach functions and a bunch of other APIs, operate in terms of field name, or instance which is field name and bundle. So, that has to change before this issue can proceed. Also, I think it is the right thing to do anyway; I've never liked that Field API is defined in terms of field names anyway.

So, see #484142: Stop using field name in the Field API, and we'll be back when that is closed. :-)

Comment #5

bjaspan CreditAttribution: bjaspan commented 7 June 2009 at 03:09

File	Size
field-bulk-delete-367753-5.patch	10.68 KB

Oh, here is my patch. It is not close to done. In fact, I haven't even tried to run it, so it probably contains syntax errors, really dumb mistakes, partial ideas in progress, etc.

Comment #6

bjaspan CreditAttribution: bjaspan commented 8 June 2009 at 00:53

Assigned:

Unassigned

» bjaspan

I'm working on this.

Comment #7

yched CreditAttribution: yched commented 8 June 2009 at 23:42

No time tonight to delve into this, just a quick note that our initial use case ("e.g. for image fields, a file needs to be removed from the filesystem" - timeout issues ensue) is actually flawed. My bad, I'm the one who brought it to the table at the FiC sprint :-(. I think filefield D6 was quite a bit in flux back then...

CCK D6 is particularly broken on the question of 'notifying field types of values being deleted', and this in fact doesn't stop imagefield/filefield from working, because filefield is in fact a 'file reference' field. Removing a reference to a file from a node doesn't mean you want to delete the file, and the same file can be referenced in different objects, and/or in different fields. Fielfield actually handles its own reference counts.

Just editing a node and changing a field value from 'a' to 'b' "deletes" value 'a', and we currently don't have any mechanism for field types to react on this - and AFAIK this has not stopped the existence of rich D6 field types. Deleting a field "deletes" a lot of values, but is it structurally different ?

I'm not saying this invalidates the need for bulk deletion - maybe we do need it and my point is moot, but given the costly implications on the code and APIs, we might want to take a step back and reconsider :
Are things really bad if we cannot warrant field types that they'll be notified whenever a "value" gets out of the system ?
It might be an issue if a "value" is an actual resource, but possibly not if it's a *reference* to a resource, and those resources have their own lifecycle. And a "foobar"-field needs to be implemented the 'reference' way if it wants to address the "I want to reuse existing foobars" common request anyway.

It's late, the above is quite possibly cryptic or irrelevant or both, sorry about that.

Comment #8

bjaspan CreditAttribution: bjaspan commented 9 June 2009 at 01:15

re #7: I'm not sure how being a reference-counted thing addresses the problem. If a filefield is a reference to a file, and a filefield field is deleted, then a whole bunch of files just lost one reference. The file manager needs to know that; it needs to get the same hook_field_delete() (or equivalent) that it would get if fielded objects were deleted one at a time.

The same would be true for text fields changing from 'a' to 'b' if in fact the text values were stored somewhere other than the database column. So yes, this is all about "external resources."

Incidentally, now that field_attach_query() is in (and once it gets a few tweaks :-), I do not think that bulk deletion is going to be a tremendous amount more code or complexity. Figuring it out was hard, but doing it should be not so bad.

Comment #9

yched CreditAttribution: yched commented 10 June 2009 at 21:48

OK, I delved a little more into D6 filefield, and its reference system (which is quite similar to the one in D7) doesn't handle the 'delete field' or 'delete bundle' cases (stale files). Pondered around this a little, but I have to withdraw and apologize for the noise: I currently see no way to avoid 'delayed bulk delete' in Field API...

Comment #10

bjaspan CreditAttribution: bjaspan commented 11 June 2009 at 04:02

File	Size
field-bulk-delete-367753-10.patch	25.39 KB

Current patch. Still not close to done, but this time it actually runs and the very beginnings of a deleted-data purge are actually occurring. Comments welcome but not required yet.

Comment #11

bjaspan CreditAttribution: bjaspan commented 16 June 2009 at 02:17

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-11.patch	43.12 KB

This patch is now worth a review, though it is still not quite done. The theory of operation is explained in a PHPdoc comment in field.crud.inc. As I've said previously, the problems solved with field_attach_query() made this a fairly straightforward patch. Things I know that are not done:

* There is no field_cron() to actually trigger the whole field_purge() system. Drupal really has no good way (that I know of) to control time limits for processes like these.
* The tests are incomplete. We at least need to test that: non-deleted instances are unaffected by purging; field_sql_storage actually purges data; pbs.module is capable of implementing this system; probably more.
* There are no new hook docs in field.api.inc; I'm waiting for the interfaces to be stable.
* There are several TODOs in the code suggesting changes to existing APIs.

I did this with a minimum of converting the Field API to use field ids instead of names. There is a certain amount of inelegance as a result, but it does make the total set of changes much smaller.

Comment #12

16 June 2009 at 02:35

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #13

bjaspan CreditAttribution: bjaspan commented 16 June 2009 at 03:41

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-11.patch	43.12 KB

Hmmm. I fixed the bug that caused those field form test failures, and they pass on my system. Attaching a new patch in case I didn't re-run cvs diff last time... but I think I did.

Comment #14

16 June 2009 at 04:00

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #15

yched CreditAttribution: yched commented 17 June 2009 at 01:33

re #11: "pbs.module is capable of implementing this system".
Does pbs need to do anything on purge ?
As a 'secondary storage', it seems it can simply delete the rows (object delete) or columns (instance / field delete) immediately without bothering about notifications and delayed purge ?

I started to review the patch, but got caught by the clock. Looking great ! Here are my remarks so far (mainly minor points, I didn't start to look at the meaty places...):

- FieldBulkDeleteTestCase::setUp()
Couldn't we use the "API" func field_test_create_bundle() instead of variable_set('field_test_bundles', $this->bundles); ? I guess it has to do with the odd location for the parent::setUp(); call ?

- double spacing in PHPdocs for field_info_field(), field_info_field_by_id(), field_attach_query()

- _field_invoke_multiple(): $deleted = isset($options['deleted']) ? $options['deleted'] : 0;
Could be moved to the // Merge default options. section at the beginning of the function.
+ Shouldn't this 'deleted' option be added to field_invoke() too, for consistency ?

- field_attach_query(): hook_field_attach_pre_query() should receive the $limit param too ?

Comment #16

yched CreditAttribution: yched commented 17 June 2009 at 12:27

Delved a little more in the juicy bits of the code functions.
As you noted in a couples TODO, I suspect some purge hooks could be refactored or merged with existing hooks. We can see that when the code settles.

- _field_info_collate_fields() collects information about 'deleted' fields (in the field_ids key) but not about deleted instances ? is that not needed ?

- field_attach_query($limit): field_sql_storage_field_storage_query() queries on *values*, not full objects, and the results are currently sorted by delta only, so $query->range(0, $limit); will produce strange results (all 0 delta values, truncated objects...).
Maybe it's in fact OK for the purge stuff (which mainly operates on values rather than objects), but it could produce very suprising behavior with other uses of field_attach_query().

- do we need both hook_field_storage_delete() and hook_field_storage_purge() ?
in field_sql_storage.module at least, one calls the other.
+ (minor) having field_sql_storage_field_storage_purge() used both as a hook implementation and as a helper function called directly by field_sql_storage_field_storage_delete() is a little confusing when maintaining or debugging code. Some other parts of core do that I think, and this always makes me tick.

- in PHPdoc below @defgroup field_purge wraps to less than 80 chars.
+ double spacing

- field_purge():

  $fields = field_read_fields(array('deleted' => 1), array('include_deleted' => 1));
  foreach ($fields as $field) {
    if (!isset($field['bundles']) || count($field['bundles']) == 0) {

I don't think field_read_fields() returns $field['bundles'] ? That information is added in field_info_collate_fields().

- field_purge_data(): do we need to clear field cache at the end ? I'd think the cache does not include deleted-but-not-purged-yet values, so it's not stale after the actual purge ?

- Similarly, field_cache_clear() in field_purge_instance() and field_purge_field() is a bit tough. We need to clean the cached 'field info', since they contain informations about deleted fields, but the actual data cache could persist. Our field_cache_clear() is probably not granular enough...

- The newly introduced $options['deleted'] in _field_invoke_multiple() is currently used nowhere ?

Comment #17

bjaspan CreditAttribution: bjaspan commented 22 June 2009 at 21:49

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-17.patch	46.31 KB

This patch is definitely still not done, but the reviews are very helpful.

re #11: "pbs.module is capable of implementing this system".
Does pbs need to do anything on purge ?
As a 'secondary storage', it seems it can simply delete the rows (object delete) or columns (instance / field delete) immediately without bothering about notifications and delayed purge ?

Whether pbs.module is primary or secondary storage remains to be seen. I may end up needing to write d6-style-storage-for-eaton.module.

- FieldBulkDeleteTestCase::setUp()
Couldn't we use the "API" func field_test_create_bundle() instead of variable_set('field_test_bundles', $this->bundles); ?

Fixed.

- double spacing in PHPdocs for field_info_field(), field_info_field_by_id(), field_attach_query()

Fixed.

- _field_invoke_multiple(): $deleted = isset($options['deleted']) ? $options['deleted'] : 0;
Could be moved to the // Merge default options. section at the beginning of the function.

Fixed.

+ Shouldn't this 'deleted' option be added to field_invoke() too, for consistency ?

Fixed.

- field_attach_query(): hook_field_attach_pre_query() should receive the $limit param too ?

Fixed.

As you noted in a couples TODO, I suspect some purge hooks could be refactored or merged with existing hooks. We can see that when the code settles.

Yes. As I tried but may have failed to explain, there is a real difference between delete and purge. purge operates on a single field; delete on all fields for a bundle. Two of the three delete hooks (not hook_field_delete()) are logically per-object, so separate delete and purge functions are needed. I'm not sure what the best DX is here.

- _field_inf_collate_fields() collects information about 'deleted' fields (in the field_ids key) but not about deleted instances ? is that not needed ?

Not yet needed, apparently. Easy to add when we do need it.

- field_attach_query($limit): field_sql_storage_field_storage_query() queries on *values*, not full objects, and the results are currently sorted by delta only, so $query->range(0, $limit); will produce strange results (all 0 delta values, truncated objects...).

Excellent point, and this DOES matter for purge, too. Suppose a filefield has 3 deltas. filefield_field_purge() needs to get all three so it can update the refcounts on all three.

I think f_a_query() does need to guarantee that it returns only complete pseudo-objects when RETURN_VALUES and $limit is passed. Design for field_sql_storage: sort by entity type/id and delta, not just delta, then throw away the possibly incomplete object it gets last. Gotcha: A field with 50 deltas on an object with $limit of 10 will never get returned. Solution: if f_a_query() has $limit and gets no complete objects to returns, it issues a second SQL query with LIMIT range,offset until it gets a complete object. We all knew that f_a_query() would get complicated.

I point out that the $limit parameter is really mandatory even separate from bulk delete, because otherwise f_a_query() is too risky to call; it is always possible it will consume all memory and take down the server. That's why I made it not a default argument; if you pass QUERY_NO_LIMIT, you *better* know what you are doing (and almost by definition, if you pass that symbol you do *not* know what you are doing; we probably should not even offer it).

- do we need hook_field_storage_purge() ?
in field_sql_storage.module at least, one calls the other.

Yes, see above.

+ (minor) ha a hook implementation and as a helper function called directly by field_sql_storage_field_storage_delete() is a little confusing when maintaining or debugging code.

Well, purge is for one field, delete is for all fields; the former is naturally a helper for the latter.

- in PHPdoc below @defgroup field_purge wraps to less than 80 chars.
+ double spacing

Fixed.

- field_purge

  $fields = field_read_fields(array('deleted' => 1), array('include_deleted' => 1));
  foreach ($fields as $field) {
    if (!isset($field['bundles']) || count($field['bundles']) == 0) {

I don't think field_read_fields() returns $field['bundles'] ?

Fixed.

- field_purge_data(): do we need to clear field cache at the end ? I'd think the cache does not include deleted-but-not-purged-yet values, so it's not stale after the actual purge ?

Indeed, marked-as-deleted data had better already be removed from the cache or we might return it! I removed the cache clear from purge.

- Similarly, field_purge_field() is a bit tough. We need to clean the cached 'field info', since they contain informations about deleted fields, but the actual data cache could persist. Our field_cache_clear() is probably not granular enough...

I added _field_info_cache_clear().

- The newly introduced $options['deleted'] in _field_invoke_multiple() is currently used nowhere ?

It is used to control how instances to operate on are retrieved:

    if ($options['deleted']) {
      $instances = field_read_field(array('bundle' => $bundle, array('include_deleted' => $options['deleted'])));
    }
    else {
      $instances = field_info_instances($bundle);
    }

HMMM. That should be field_read_instances(), not field_read_fields(), eh? So how the frack are my tests passing? Must look into that. :-)

Comment #18

22 June 2009 at 22:05

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #19

yched CreditAttribution: yched commented 22 June 2009 at 23:10

"Solution: if f_a_query() has $limit and gets no complete objects to returns, it issues a second SQL query with LIMIT range,offset until it gets a complete object"

Agreed, but how do we know we didn't fetch all the values for the object ?
Is it problematic if the query has no limit and we control we only fetch $limit objects in a PHP loop that retrieves individual rows ? I'm not sure if it solves the memory / performance issues.
Other than that, I completely agree f_a_query() requires a $limit param. Question: does it also mean it needs an $offset param ?

"- The newly introduced $options['deleted'] in _field_invoke_multiple() is currently used nowhere ?
- It is used to control how instances to operate on are retrieved."

Sorry, I meant I see no actual call to _field_invoke_*() that makes use of it - which might explain why a bug in there doesn't affect the tests :-)

Comment #20

bjaspan CreditAttribution: bjaspan commented 23 June 2009 at 00:19

"Solution: if f_a_query() has $limit and gets no complete objects to returns, it issues a second SQL query with LIMIT range,offset until it gets a complete object"

Agreed, but how do we know we didn't fetch all the values for the object ?

I thought about this for a minute and realized, wait: why aren't we just using f_a_load() to get all the values for all of the objects returned by the query? I looked at the f_a_query() code and discovered... we are. Then I looked again at field_sql_storage_field_storage_query() and see that when it gets FIELD_QUERY_RETURN_VALUES, it is doing more work to retrieve and assemble all the data. Am I confused, or is that wasted work because f_a_query() is just going to f_a_load() the data anyway?

ISTM that hook_field_storage_query() should only return object ids, and f_a_load() should be used to retrieve the data. I guess that might mean hook_field_storage_query() is throwing away data is has handy, to be loaded again in f_a_load(). But it does nicely solve the confusion with $limit for f_a_query().

Is it problematic if the query has no limit and we control we only fetch $limit objects in a PHP loop that retrieves individual rows ? I'm not sure if it solves the memory / performance issues.

My understanding is that that does not help. The mysql server always pushes all returned results to the client, which buffers them. The mysql docs I've read are clear on this.

Other than that, I completely agree f_a_query() requires a $limit param. Question: does it also mean it needs an $offset param ?

I knew someone would ask this. Probably, yes. $limit could be $range, either an int (limit) or array pair (range).

Comment #21

yched CreditAttribution: yched commented 23 June 2009 at 01:40

"why aren't we just using f_a_load() to get all the values for all of the objects returned by the query? I looked at the f_a_query() code and discovered... we are"

Not exactly, we're only calling hook_field_load() and hook_field_attach_load() on the values returned by hook_field_storage_query().
That is, the main difference between f_a_load() and f_a_query() is that hook_field_storage_query() is used instead of hook_field_storage_load() to provide the stored values.

"hook_field_storage_query() should only return object ids, and f_a_load() should be used to retrieve the data"

The difference is that hook_field_storage_query() only returns the subset of values that match the conditions.
I'm not completely sure whether that's a limitation or a feature...

Comment #22

bjaspan CreditAttribution: bjaspan commented 23 June 2009 at 16:33

On whether f_a_query() should return complete pseudo-objects or just certain values, I observe that our docs say:

/**
 * Retrieve objects matching a given set of conditions.
 */
function field_attach_query($field_id, $conditions, $limit, $result_format = FIELD_QUERY_RETURN_IDS, $age = FIELD_LOAD_CURRENT) {

"Retrieve objects."

Hmmm. Perhaps I do not need to care about whether f_a_query() returns values or objects; I can ask it to return only ids, and then use f_a_load() explicitly to ensure I get complete object data. This also makes the $limit parameter make sense. However, it also means that f_a_load() needs an option to allow loading deleted fields.

I'm should stop ignoring this meeting I'm attending now...

Comment #23

yched CreditAttribution: yched commented 24 June 2009 at 17:49

About the PHPdoc : hm, true, but I'd say it's rather the one-liner that is inaccurate ;-). The rest of the PHPdoc clearly states 'not all values'.

"Perhaps I do not need to care about whether f_a_query() returns values or objects; I can ask it to return only ids, and then use f_a_load()"
Well, this sidesteps the issue for the bulk delete case, but leaves the question of $limit problematic for FIELD_QUERY_RETURN_VALUES.

Taking a step back. So, we need $limit. But then we also need $offset, because there's no way to get all results without - unless the operation done on a bunch of results deletes the corresponding rows, in which case you always use offset 0. That's the case for bulk delete, but not for other f_a_query() use cases.

The problem is that
- for the code calling _f_a_query(), the only unit that makes sense for $limit / $offset is 'full objects' (# of objects to retrieve, # of objects to skip). $limit in 'values/db rows' is unpredictable: leads to incomplete objects, and you can get n < $limit results while there are still results to be loaded.
- but the SQL query built in field_sql_storage_field_storage_query() is only able to use 'db rows' as a unit.

The following logic for field_sql_storage_field_storage_query($limit) should ensure it actually returns $limit (or less) full objects, whatever the result mode (return ids or return values):

$query = "the query, without range for now, sorted by obj_id and delta"
$offset = 0;
while (the previous iteration returned $limit rows) {
  $query->range($offset, $limit);
  $results = $query->execute(); 
  foreach ($results as $row) {
    $offset++;
    if (we hit ($limit + 1) different objects) {
      break the while loop;
    }
    put the row into our $results array;
  }    
}

For the $offset, the only way I see right now is that f_a_query() takes $offset by ref, and field_sql_storage_field_storage_query() changes it into 'the value that you need to provide if you want the next bunch of results".
$limit is in 'objects', $offset is in 'whatever makes sense for the storage engine' (you'll usually start at 0 anyway)

Not completely straightforward, but that's the best I can think of

Comment #24

bjaspan CreditAttribution: bjaspan commented 24 June 2009 at 17:54

That's basically what I was thinking; sorting by entity_id (and entity_type) and delta is the key.

Why do you think the first query needs to be without range? It seems like we can start with ->range(0, $limit), no?

Comment #25

yched CreditAttribution: yched commented 24 June 2009 at 18:43

"It seems like we can start with ->range(0, $limit), no?"
well, the others in the following while() iterations will need to adjust $query->range($offset, $limit); anyway

Comment #26

bjaspan CreditAttribution: bjaspan commented 24 June 2009 at 20:49

Right. But if we start with an unlimited query, we are telling the server to do a potentially enormous amount of work and send a bunch of rows that will get buffered in local memory which, if we have a limit, we won't use.

So, we should start with an initial limit if one is provided.

I guess I will do this as part of the bulk delete patch, since that it what is forcing the $limit parameter. One could argue that this should be a separate issue, but whatever. :-)

Comment #27

yched CreditAttribution: yched commented 24 June 2009 at 23:03

Sure, I'm not suggesting we run an unlimited query; rather: we *build* the query without range, then execute it several times in a while() loop that adds the relevant range each time.
Anyway, I'm pretty sure we mean the same thing :-)

Comment #28

bjaspan CreditAttribution: bjaspan commented 25 June 2009 at 21:28

Oh! Now I understand what you meant. I did not interpret your pseudo-code to mean what you said in #27. Yes, we are in agreement.

Comment #29

bjaspan CreditAttribution: bjaspan commented 1 July 2009 at 03:57

@yched: This is getting worse and worse.

To implement the query-looping idea we've been discussing, as you said in #23, we need to return the offset that the caller can use to get the next bunch of objects. However, I'm not really that confident we can return exactly the right offset. I see in your pseudo-code that you are breaking the inner loop as soon as we read the $limit+1st object, so presumably $offset will be the exact offset that will return the query to the correct place in the result set to start on the next object... but this seems like very fragile code/logic, and if anything changes in the data, we'll end up returning objects with incorrect data items, and that seems like a bad idea. You basically spelled out the problem with the $limit object vs. rows mismatch in #23.

So... please remind me why we absolutely must support FIELD_QUERY_RETURN_VALUES. The code really duplicates the logic in hook_field_storage_load(). I understand that if we do not support FIELD_QUERY_RETURN_VALUES, then callers will need to get the ids from f_a_query(), which performs the db query, and then call f_a_load() which performs a very similar query to retrieve very similar data, and of course that is not ideally efficient. However, we do have load_multiple, and I suspect most calls to f_a_query() will be for a fairly low $limit (e.g. one page of terms or whatever), so the multiple-load shouldn't be too painful.

The reason to consider removing FIELD_QUERY_RETURN_VALUES is that it makes the $limit parameter very problematic, and as we both now agree, $limit is really mandatory. I think it is BETTER to force two queries (one from f_a_query() and one from f_a_load()) for similar data than to basically require every call to f_a_query() to retrieve a potentially enormous amount of data it does not need, swamping memory and the database.

Comment #30

yched CreditAttribution: yched commented 1 July 2009 at 15:17

and if anything changes in the data, we'll end up returning objects with incorrect data items

You mean, if the calling code does data writes between f_a_query(some bunch) and f_a_query(next bunch) ? Hm, yes, I guess that would mess the $offset pretty bad...

please remind me why we absolutely must support FIELD_QUERY_RETURN_VALUES

Well - it all started in #392494-5: Fields API needs to allow searching of scalars, by yourself ;-)
I wouldn't die if we removed that and went for "Call f_a_query() to get ids, then load the objects if you need the values".
Only thing we lose AFAICT is : If I want only to reason / act on the field values that match the actual conditions, I need to filter them out of the loaded objects myself.

But: doesn't FIELD_QUERY_RETURN_IDS have the exact same $limit / $offset mismatch issues ? The query it runs is still based on 1 row = 1 value != 1 object...
FIELD_QUERY_RETURN_VALUES just adds to the 'SELECT' part - unless I'm mistaken, this is not where our problem lies.

I'm not sure there's a way to work with $limit / $offset without persisting the 'query result' resource. I see this turn into a 'design a mini DBTNG API, for generic, non-PDO, non SQL backends' :-(

/me feels a bit dizzy looking at the can of worms that is pluggable storage while we current don't gain any actual benefits...

Comment #31

bjaspan CreditAttribution: bjaspan commented 1 July 2009 at 17:19

Not just the code calling f_a_query() and f_a_query() again can write to the database, but any other caller can write to the database as well.

Yes, RETURN_IDS has the same limit/offset mismatch problems that RETURN_VALUES does. The difference is that RETURN_IDS only either returns an id or it does not, whereas RETURN_VALUES might return a partial set of values for an object. So the limit/offset mismatch for RETURN_IDS means that if someone else writes to the database, and you are doing a pager thing, your pager might not be exactly accurate (you might miss some objects, or list them on adjacent pages, etc.), but that is always the case with pagers into dynamic databases. A limit/offset mismatch for RETURN_VALUES means a potentially invalid return set (e.g. if I query by entity type, bundle, or entity id for a field, I should get back ALL the values for that field for each object returned, not just some of them).

To summarize:

* A limit/offset parameter to f_a_query() is not reliably and efficiently implementable for RETURN_VALUES (or least we do not know how yet).
* Calling f_a_query() without a limit/offset (even with RETURN_IDS) is potentially a suicidal operation; I suppose can we can support it because sometimes the site owner knows it is safe.
* Therefore, my personal conclusion is that we should remove RETURN_VALUES from f_a_query() entirely. The alternative is to say, "passing $limit for RETURN_VALUES may return invalid results" which is just another way of saying, "using RETURN_VALUES is either wrong or suicidal." :-)

However, and I forgot this last night, I do not actually need to fight the RETURN_VALUES battle for bulk delete, I can simply use RETURN_IDS.

Comment #32

yched CreditAttribution: yched commented 1 July 2009 at 19:56

Agreed on the conclusions. Off with RETURN_VALUES - doh, that's where 80% of the work on f_a_query() went :-)

The 'single field' mode for _field_invoke(_multiple) was primarily introduced for RETURN_VALUES and value massaging, but I don't think we need to remove it.
field_view_field(), for instance (shorthand API function to display a single field) doesn't currently use it, but technically it should.

Comment #33

bjaspan CreditAttribution: bjaspan commented 7 July 2009 at 03:06

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-33.patch	59.1 KB

Okay, here's the latest patch. Setting CNR to see what the bot thinks of it, not because it is remotely done yet. At this point I think it might be possible that I actually understand how to implement this whole damn thing, but I am beginning to wonder if it is worth the complexity (I'd say "effort" but perhaps the effort has already been expended).

The big changes in this version are:

* f_a_query() no longer supports RETURN_VALUES and instead accepts $count and &$cursor arguments so callers can page through the objects matching the query. This will certainly break any code that currently calls f_a_query(), which the testbot will find. :-)
* f_a_load() takes a $options argument since it needs to allow loading deleted fields and single fields (so far, only at the same time, but they are currently treated separately).

The promised subtleties regarding field_name and field_id are cropping up. There are various places where data structures must be keyed by field id, and field_info_field_by_id() or field_read_instances() must be used to retrieve deleted field info, leading to somewhat inelegant code. We decided not to switch the whole API to use ids instead of names, so this is what we get.

We really took on a pretty complex problem when we decided to try to avoid orphaning data with bulk deletes (like the rest of Drupal does). I'm confident there are still gotchas we haven't thought of.

Comment #34

8 July 2009 at 03:35

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #35

bjaspan CreditAttribution: bjaspan commented 16 July 2009 at 03:56

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-35.patch	49.31 KB

Rerolled the patch now that #512236: Design flaws in field_attach_query() is in. My comments in #33 still apply: it now appears that the questions about how to possibly make this functionality work are resolved, and now there are just some decisions to make about how we want it to operate.

One open question: Who calls field_purge($batch_size) and when? My best guess so far is that hook_cron() calls it with a batch size determined by an admin form, just like search indexing. Or maybe we just always call it on cron with a batch size of 10 or 100 or something. Or we do that but conditionalized on a variable so a contrib module can do something more flexible.

Comment #36

yched CreditAttribution: yched commented 17 July 2009 at 02:27

No time to review tonight.

Firing purge on cron with a variable-based size sounds fine IMO. We could add a UI for the variable, but there's currently no other 'Field API setting', and I'm not sure this warrants one...

I guess field_ui could add a dsm('Field foo was marked as deleted. Data will be purged progressively. Click [[here|link_to_batch_page]] to purge right now'). Doesn't sound too difficult, but that definitely doesn't belong in this patch.

Comment #37

yched CreditAttribution: yched commented 19 July 2009 at 00:40

Awesome work. Remarks below are mainly details.

- _field_invoke(), _field_invoke_multiple():
$options['deleted'] is 0/1 based, while $options['default'] is TRUE/FALSE based. It's really a boolean flag, isn't it ?
$options['field_name'] is changed to $options['field_id'] in _field_invoke_multiple(), but left untouched in _field_invoke() (though field_purge_data() uses _field_invoke($options['field_id'])). Ideally, I'd say both functions should support both options ?

- _field_invoke_multiple();
There are a couple places where $instance['field_id'] could be replaced by the newly added $field_id variable.

- I'm wondering if the $field_id param for field_attach_query() couldn't accept a field_id *or* a field_name. Would just require we don't allow field names to be all numeric. Seems reasonable to me, we already limit to [a-z0-9_].

- field_sql_storage_field_storage_load()
foreach ($instances as $field_name => $instance) {: field_read_instances() results are just numerically indexed arrays, so we cannot assume $field_name as keys.

- field_sql_storage_field_storage_query()

+    // TODO: This probably does not match an index.
+    // TODO: I do not think we need to sort at all any more.
+    //->orderBy(array('etid', 'entity_id', 'delta'))

We still need to sort on etid, entity_id, since the code assumes we rows for a given object in a sequence.
Currently the data tables have a unique index on array('etid', 'entity_id', 'deleted', 'delta'), so this should be good enough, right ?

+      
+      $return[$row->type][$id] = field_attach_create_stub_object($row->type, array($row->entity_id, $row->revision_id, $row->bundle));
+      --$count;

Looks like a wrong reroll after #512236: Design flaws in field_attach_query(), this hunk is not needed anymore (code is already present above).

- field_sql_storage_field_storage_delete() relies on $instances = field_info_instances($bundle);, and thus purges data only for non-deleted instances.
So if an instance is marked as deleted, and just after that, a node is deleted, field_attach_delete($node) will leave the data for the marked-as-deleted instance - but they will be caught by the next purge, right ? (Just to be sure)

- field_purge():

   // TODO: Perhaps the include_deleted argument to this
  // function should be replaced by inspecting the query parameters.

I don't think I get this.

- field_purge_instance() and field_purge_field() should call _field_info_clear_cache() instead of field_cache_clear() ?

- _field_info_clear_cache() should be ..._cache_clear(). Also, not sure if we want the function private or public.

- Still trying to sort out TODO: refactor field_purge_data() with field_attach_delete()? :-)

- shouldn't field_sql_storage.module implement hook_field_storage_purge_field() ?

- field.test, FieldBulkDeleteTestCase::getInfo()

      'name'  => t('Field bulk delete tests'),
      'description'  => t("Bulk delete fields and instances, and clean up afterwards."),

Core doesn't translate those strings anymore + double spacing before the => (those were fixed in the rest of field.test meanwhile)

- field.test: missing PHPdoc for _generateStubObjects() and testDeleteFieldInstance()

- field.test, setUp():
Minor: field_test_create_bundle() doesn't actually need the $desc param.
Minor: why 'bb' and 'bf' for bundles and fields names, instead of just 'b' and 'f' ?

- field.test: do we need the tearDown() at all ? I though the target db was reinitialized after each test method ?
It seems we could live with no tearDown() and $this->fields = array(); $this->instances = array(); moved at the beginning of setUp() ?

- field.test: A few one-line comments miss a trailing point. Multiline comments could wrap at a longer line length. A comment in testPurgeInstance() has a double spacing.

- I created #523842: introduce FieldWebTestClass wrapper class to avoid redefining _generateTestFieldValues() in each test class. Should not hold this patch back, obviously.

- Not for this patch, but it seems we're missing a hook_field_storage_pre_delete() ? How does PBS react on object deletion ? (would probably be best in a separate thread, unless you have a no-brainer 'no need' answer...)

- Probably not for this patch either, but 'fieldable comments' might raise the question of 'immediate delete vs bulked delete' for single object deletion too :-/
A comment being deleted triggers deletion of all its subthread...

Comment #38

bjaspan CreditAttribution: bjaspan commented 22 July 2009 at 04:18

File	Size
field-bulk-delete-367753-38.patch	51.25 KB

Excellent review, thanks. You found a number of real issues. I'm sure there is more to do, but it is good to get this much solid. I've fixed everything from #37 except as follows.

- I'm wondering if the $field_id param for field_attach_query() couldn't accept a field_id *or* a field_name.

Ugh. I really don't like that. It is really not that hard to use field ids.

- field_purge():
   // TODO: Perhaps the include_deleted argument to this
  // function should be replaced by inspecting the query parameters.

I don't think I get this.

f_a_query() takes a list of query parameters including 'deleted'. field_read_instances() takes a list of query parameters, with a separate list of parameters for 'deleted' and 'inactive'. I'm thinking that 'deleted' at least should just be a normal query parameter; for consistency, it should default to FALSE. Thoughts?

Minor: why 'bb' and 'bf' for bundles and fields names, instead of just 'b' and 'f' ?

Don't remember. :-) "Bulk bundle" and "bulk field", probably.

- Not for this patch, but it seems we're missing a hook_field_storage_pre_delete() ? How does PBS react on object deletion?

I haven't touched pbs.module in a while. However, the reason there is no pre_delete() is that I did not think it could matter to prevent the field storage engine from deleting whatever data it has; at worst, it allows a delete query that doesn't have anything to delete. So, pbs.module can just implement hook_f_a_delete() instead.

Probably we should have pre_delete just for consistency. Separate issue, I think.

Comment #39

yched CreditAttribution: yched commented 24 July 2009 at 23:42

File	Size
field-bulk-delete-367753-39.patch	51.29 KB

I gave this another careful reading:

- Fixed a couple remaining double spaces

- Unified the PHPdocs for field_id/field_name options in _field_invoke()/_field_invoke_multiple().

- About // TODO: Perhaps the include_deleted argument to field_read_instances() should be replaced by inspecting the query parameters. and

f_a_query() takes a list of query parameters including 'deleted'. field_read_instances() takes a list of query parameters, with a separate list of parameters for 'deleted' and 'inactive'. I'm thinking that 'deleted' at least should just be a normal query parameter; for consistency, it should default to FALSE.

The difference is that a condition on 'deleted' in f_a_query() restricts results to either 'deleted only' or 'non-deleted only', while field_read_instances() does either 'non-deleted only' or 'both deleted and non-deleted'. Which IMO fits the usual use cases of both functions, so the 'inconsistency' is OK by me.
Of course, this means that if you want to get 'only non-deleted instances', you need a somewhat verbose field_read_instances(array('deleted' => 1), array('include_deleted' => 1)); - but apart from field_purge(), I see no real use case for this.
I removed the related TODOs in field_purge().

- Updated the PHPdoc for _field_info_collate_fields(), and reworked the function to trigger a single field_read_fields() call, and not store non-deleted fields twice in the cache.

- Which outlines: _field_info_collate_fields() collates all fields (including deleted) in 'field_ids', but only collects non-deleted instances.
Which translates to: we provide field_info_field_by_id() but not field_info_instance_by_id().
If we did, then field_purge() could use field_info_fields_by_id() and field_info_instances_by_id() instead of field_read_fields() / field_read_instances(), and get rid of the "We cannot use function X because it does Y" comments and the dance between field_read_*() and field_info_*() in field_purge().
I have no strong bias on this, just a suggestion.

Other than this last suggestion, this looks ready to me :-)

The more I think about it, the more I'm convinced that single object deletion should be bulked as well.
- It would address the questions of 'merge field_purge_data() and field_attach_delete()'.
- It would simplify the explanations in @defgroup field_purge quite a lot :-). One single model for data deletion.
- 'Fieldable comments' bring the most obvious cases where a single-object delete can cascade to arbitrary amounts of data deletion (node delete -> all attached comments should be deleted; comment deleted -> the comment subthread should be deleted). But 'cascade deletion through nodereference fields' is a long-standing D6 request (optional, obviously), that would also imply such scenarios, I'm sure we can think of others...
As I said earlier, I'm perfectly fine with keeping this for a followup, though.

Comment #40

bjaspan CreditAttribution: bjaspan commented 25 July 2009 at 04:16

I continue to be impressed by your amazing time and work put into Field API. Thanks! :-)

This followup replies to your comments in #39; my next one will (attempt to) be a from-the-top review of the patch in its current state.

1. I realize the semantic difference between the deleted options for f_a_query() and field_read_instances(). What I am suggesting is that we change f_read_instances() to treated 'deleted' as nothing special. If you don't query on it, you can both deleted and non-deleted (what you currently get with "included_deleted"). If you do query on it, you get only what you ask for.

I think this makes sense because there is exactly *one* place that calls f_read_instance() without passing an include_* value in the second argument, namely in _field_info_collate_fields(), and many places that specify the include_* values. It would be very simple for collate_fields() to pass array('deleted' => 0), and would all the other calling code cleaner.

In fact, if we actually wanted to implement field_info_instance(s)_by_id(), even collate_fields() would not want to restrict to deleted instances; at that point, NO caller would want to restrict, so EVERY caller would pass the "optional" include_* argument, making it pretty clear the include_* arg should disappear, at least for 'deleted'.

2. As you know I want to convert the whole Field API to using ids instead of names, but since the decision was made not to, I tried to make a few id-related API changes as possible. I added f_i_field_by_id() because it made the code a lot cleaner.

I did not add f_i_instance_by_id() because there is no use case for it. That function would logically return a single instance, by id. field_purge() needs to retrieve *all* deleted instances. We could add f_i_instance*S*_by_id() which would return all instances, keyed by id, but that is confusing different than f_i_field_by_id() which returns only a single field.

I did notice that field_purge() was using field_read_fields() where it could use f_i_field_by_id(), so I simplified that.

I have no particular objection to making single-object delete be delayed; I agree that one approach is better than two. I also agree that should be a separate followup.

Comment #41

bjaspan CreditAttribution: bjaspan commented 25 July 2009 at 05:04

Status:

Needs review

» Needs work

This is an attempt to review this patch as if I did not write it; I'll try to follow up with a revised patch addressing the issues I found soon. I must say I feel like this bulk delete thing will never end! I seriously wonder if we should have attempted to address it at all; perhaps just orphaning data isn't such a terrible thing.

Anyway, here we go:

* This pattern occurs more than once:

  if ($options['deleted']) {
    $instances = field_read_instances(array('bundle' => $bundle), array('include_deleted' => $options['deleted']));
  }
  else {
    $instances = field_info_instances($bundle);
  }

The code would be cleaner if a single API function could handle both cases.

* _field_invoke_multiple() groups by $field_id but uses $object->$field_name. A comment explaining why would be helpful.

* This looks like a bug, should use field_id. This indicates that there are no tests for refusing to cache when reading a single field.

  // In addition, do not write to the cache when loading a single field.
  $cache_write = $cache_read && !isset($options['field_name']);

* Now that field_attach_load() takes a $options argument, shouldn't age be one of them instead of a separate argument? Followup issue.

* Sometimes a 'deleted' option means "deleted status must match this", other times 'deleted' of TRUE means "deleted and non-deleted status is allowed." We should use "include_deleted" for the latter.

* field_purge() php doc: "Field" should be plural in the first sentence. Also, the docs talk about numbers of field data items, except purge now operates in terms of objects because that is what f_a_query() returns.

* field_purge(): Use field_info_field_by_id() instead of read_instances().

* field_purge(): Passes deleted as 1, not TRUE. This probably occurs elsewhere, too.

* field_purge_field()'s phpdoc says it assumes all instances have been deleted, but then it checks and throws an exception if they haven't been. I wonder if there are other places we should check and throw exceptions in this patch, though none occur to me this night at 12:45am.

* _field_info_collate_field(): Patch removes documentation of the 'bundles' element.

* There is no field_cron() to call field_purge().

* There is no test that deleted field records are purged when there are no more instances, nor that a non-deleted field is not purged after a deleted instance is.

* skip_fields is still defined in terms of field name. If _field_invoke/multiple is being invoked on deleted fields, it is entirely possible that there will be multiple deleted fields with the same name. Therefore, skip_fields must be keyed by id in that case, and should be keyed by id in all cases for consistency.

Comment #42

yched CreditAttribution: yched commented 30 July 2009 at 01:03

I see your point on dropping 'include_deleted' in field_read_[fields|instances]() in favor of a 'deleted = TRUE|FALSE' in $param, and I agree.
This is actually inline with my intent in #535034: Clean-up how fields and instances are prepared for runtime. to make field_info_*() the primary provider for $field and $instances definitions.
If _field_info_collate_types() collects both deleted and non-deleted fields/instances, there should be very little use cases for field_read_*().

Then it's "just" a question of what is the right set of helper _info functions to get $fields and $instances out of _field_info_collate_types():
fields, instances, by name, by id, deleted + non-deleted, only non-deleted.
- 'deleted + non-deleted' vs 'only non-deleted' sound like an 'include_deleted' param to me ;-).
- I make a second attempt with my suggestion to have $name arguments for field_info_fields() and field_info_instances() accept ids too... would get rid of the _by_id variants...

That would be:
field_info_fields($include_deleted = FALSE);
field_info_instances($bundle, $include_deleted = FALSE);
field_info_field($name_or_id);
field_info_instance($name_or_id);

Comment #43

bjaspan CreditAttribution: bjaspan commented 4 August 2009 at 01:32

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-43.patch	60.31 KB

Okay, new patch!

# Fixed issues from my review in #39

* _field_invoke_multiple() groups by $field_id but uses $object->$field_name. A comment explaining why would be helpful.

* This looks like a bug, should use field_id. This indicates that there are no tests for refusing to cache when reading a single field.
  // In addition, do not write to the cache when loading a single field.
  $cache_write = $cache_read && !isset($options['field_name']);
* field_purge() php doc: "Field" should be plural in the first sentence. Also, the docs talk about numbers of field data items, except purge now operates in terms of objects because that is what f_a_query() returns.

* field_purge(): Use field_info_field_by_id() instead of read_instances().

* _field_info_collate_field(): Patch removes documentation of the 'bundles' element.

* There is no field_cron() to call field_purge().

* There is no test that deleted field records are purged when there are no more instances, nor that a non-deleted field is not purged after a deleted instance is.

# Topics to handle as separate issues

Clean up field_read_instances() vs. field_info_instances() for deleted instances, and generally simplify the Field Info API for deleted things.

Move field_attach_load()'s $age argument into $options.

I re-defined skip_fields in terms of field id, but skip_fields is actually not set anywhere in core and there are no tests for it. So, "test skip_fields" is a separate issue.

# Unresolved (?) issues

We need to clean up the semantics of 'deleted' options throughout the API. Sometimes a 'deleted' option means "deleted status must match this", other times 'deleted' of TRUE means "deleted and non-deleted status is allowed." We should use "include_deleted" for the latter. Sometimes deleted is an int and sometimes a boolean; this might be right (b/c sometimes it is a db query parameter and sometimes a functional request), but should be considered.

I'd like to see all the above be a separate issue so we can put a stake in the ground for this never-ending issue (only 40-ish comments here but HOURS of conversation offline). I could be convinced this belongs in the current issue if someone argues hard enough, but it will certainly delay the patch, likely cause more patch conflicts, and increase the risk of it not getting done.

field_purge_field()'s phpdoc says it assumes all instances have been deleted, but then it checks and throws an exception if they haven't been. I can't get concerned about an extra validation check, so I'm defining this as a non-issue.

For the first time, I think this patch is committable, though I won't be surprised if additional reviews turn up more issues.

Comment #44

4 August 2009 at 01:50

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #45

bjaspan CreditAttribution: bjaspan commented 4 August 2009 at 02:12

Status:

Needs work

» Needs review

File	Size
field-bulk-delete-367753-45.patch	61.12 KB

Ooops. I forgot that this is the patch that changes f_a_query() into taking a field id, not name, so I also have to update callers to f_a_query (i.e.: taxonomy.module).

Comment #46

yched CreditAttribution: yched commented 5 August 2009 at 23:38

Status:

Needs review

» Reviewed & tested by the community

File	Size
field-bulk-delete-367753-46.patch	58.69 KB

Looks just fine to me. Yay !
Updated patch just fixes 80 chars line-wraps on comments and a couple double spaces, and moves field_cron() with other core hooks towards the top of field.module.

Minor nitpick: I think we could be a bit more ambitious on the default value for the 'field_purge_batch_size' variable. Maybe 50 or 100 instead of 10 ?
OTOH, it's not completely clear to me what tasks filefield will actually perform when purging a value...

Well, I don't think this should hold the patch anyway. This is RTBC.

Comment #47

yched CreditAttribution: yched commented 5 August 2009 at 23:47

File	Size
field-bulk-delete-367753-47.patch	58.7 KB

Found a few other comments to fix in field.test.

+ FieldBulkDeleteTestCase::setUp():

    // TODO: Create a second field that we do not delete so we can
    // test that its data stays around.

It *looks* like this is actually done, but I'll let you confirm.

Comment #48

bjaspan CreditAttribution: bjaspan commented 6 August 2009 at 04:26

File	Size
field-bulk-delete-367753-48.patch	60.93 KB

You're right, that TODO is done, new patch just removes the comment.

Comment #49

sun

German

Karlsruhe

CreditAttribution: sun commented 9 August 2009 at 04:28

A rather quick, coding-standards only review:

+++ modules/field/field.attach.inc	6 Aug 2009 04:25:26 -0000
@@ -152,6 +152,11 @@ class FieldQueryException extends FieldE
+ *  - 'field_id'
+ *    The id of the field whose operation should be invoked. By default, the
+ *    operation is invoked on all the fields in the objects' bundles.
@@ -161,22 +166,35 @@ class FieldQueryException extends FieldE
+ *  - 'deleted'
+ *    If TRUE, the function will operate on deleted fields as well as
+ *    non-deleted fields. If unset or FALSE, only non-deleted fields are
+ *    operated on.

Yes. The other code gets it wrong. But we actually do:

 *  - 'field_id': The id of the field whose operation should be invoked. By
 *    default, the operation is invoked on all fields in the objects' bundles.

There is no need and no point in introducing wrong code just to adhere to equally already existing, wrong code.

(and elsewhere)

+++ modules/field/field.attach.inc	6 Aug 2009 04:25:26 -0000
@@ -161,22 +166,35 @@ class FieldQueryException extends FieldE
-    if (empty($options['field_name']) || $options['field_name'] == $field_name) {
+    if ((empty($options['field_id']) || $options['field_id'] == $instance['field_id']) && (empty($options['field_name']) || $options['field_name'] == $field_name)) {
@@ -259,22 +287,36 @@ function _field_invoke_multiple($op, $ob
-      if (empty($options['field_name']) || $options['field_name'] == $field_name) {
+      if ((empty($options['field_id']) || $options['field_id'] == $field_id) && (empty($options['field_name']) || $options['field_name'] == $field_name)) {

empty() should be replaced with !isset() here to

a) safe some cycles

b) make the actual condition more understandable

c) make any sense of the actual condition.

+++ modules/field/field.attach.inc	6 Aug 2009 04:25:26 -0000
@@ -259,22 +287,36 @@ function _field_invoke_multiple($op, $ob
+  // contain data for a single field for each name, even if that field is deleted,

Minor: This line exceeds 80 chars.

+++ modules/field/field.crud.inc	6 Aug 2009 04:25:26 -0000
@@ -681,3 +684,218 @@ function field_delete_instance($field_na
+ *   * Invoking the Field Type API hook_field_delete() for each field on the
+ * object. The hook for each field type receives the object and the specific
+ * field being deleted. A file field module might use this hook to delete
+ * uploaded files from the filesystem.
+ *   * Invoking the Field Storage API hook_field_storage_delete() to remove
+ * data from the primary field storage. The hook implementation receives the
+ * object being deleted and deletes data for all of the object's bundle's
+ * fields.
+ *   * Invoking the global Field Attach API hook_field_attach_delete() for all
+ * modules that implement it. Each hook implementation receives the object
+ * being deleted and can operate on whichever subset of the object's bundle's
+ * fields it chooses to.

Lists in PHPDoc, we usually format as:

 * The list:
 * - One.
 * - Two is a list item that is much longer than eighty chars and must thus be
 *   wrapped onto the next line.
 * - Three.

+++ modules/field/field.module	6 Aug 2009 04:25:26 -0000
@@ -178,6 +182,16 @@ function field_theme() {
+ * Purges some deleted Field API data, if any exists.

s/some//

+++ modules/field/field.test	6 Aug 2009 04:25:26 -0000
@@ -1755,3 +1762,232 @@ class FieldInstanceCrudTestCase extends 
+   * Given an array of potentially fully-populated objects and an
+   * optional field name, generate an array of stub objects of the
+   * same fieldable type which contains the data for the field name
+   * (if given).
...
+   * Verify that deleting an instance leaves the field data items in
+   * the database and that the appropriate Field API functions can
+   * operate on the deleted data and instance.
+   *

PHPDoc summary missing - contains a description.

22 days to code freeze. Better review yourself.

Comment #50

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick commented 11 August 2009 at 05:50

Status:

Reviewed & tested by the community

» Needs work

Documentation in here is excellent; great job capturing the original thought behind the issue. It's kind of crazy that we jump through these kinds of hoops for fields, when we don't jump through them for anything else: nodes, files, etc. But I guess you have to start somewhere. :P

+++ modules/field/field.crud.inc	6 Aug 2009 04:25:26 -0000
@@ -681,3 +684,218 @@ function field_delete_instance($field_na
+function field_purge($batch_size) {

I got turned around in the tests trying to figure this function out, especially from these lines:

+      // Purge the data.
+      field_purge(10);
+
+      // Purge again to purge the instance.
+      field_purge(0);

At first I thought 10 was the field ID, but then I had no idea why 0 was passed in the second time. I had to go looking through the docs to discover that it's a *batch* purge operation, which I never would've guessed.

I thought this was named field_purge for symmetry with other field API functions, but at http://api.drupal.org/api/file/modules/field/field.crud.inc/7 I don't see any other two-word functions. Could we rename this function to field_batch_purge() or something similar that is more indicative of what it does?

Also, should $batch_size default to 0? that seems to be the most common value.

One other thing: I applied this patch and then installed CCK and it still didn't make fields delete. I guess this is because the purge stuff was missing. But it makes it really hard to "test" this functionality without a UI around it to see how it affects users. I guess the important thing is to get the API change in now so we can tweak that stuff after code freeze.

The rest of these comments are minor, and sun may have already caught in his review above (I haven't read it yet cos I was trying to come into this fresh). Fix those up, and I think this is good to go from my perspective.

+++ modules/field/field.test	6 Aug 2009 04:25:26 -0000
@@ -1755,3 +1762,232 @@ class FieldInstanceCrudTestCase extends 
+class FieldBulkDeleteTestCase extends DrupalWebTestCase {

Line of summary PHPDoc please.

+++ modules/field/field.test	6 Aug 2009 04:25:26 -0000
@@ -1755,3 +1762,232 @@ class FieldInstanceCrudTestCase extends 
+    $this->bundles = array('bb_1' => 'bb_1', 'bb_2' => 'bb_2');

drupal_map_assoc() might be slightly cleaner, but since there are only two here it's probably ok.

+++ modules/field/field.test	6 Aug 2009 04:25:26 -0000
@@ -1755,3 +1762,232 @@ class FieldInstanceCrudTestCase extends 
+   * but depends on this class' setUp().

class's

+++ modules/field/field.test	6 Aug 2009 04:25:26 -0000
@@ -1755,3 +1762,232 @@ class FieldInstanceCrudTestCase extends 
+  function testPurgeInstance() {
...
+  function testPurgeField() {

A line of summary PHPDoc needs to be added to these testX functions.

+++ modules/field/field.test	6 Aug 2009 04:25:26 -0000
@@ -1755,3 +1762,232 @@ class FieldInstanceCrudTestCase extends 
+      $this->assertEqual($stubs[$obj->ftid], $obj, "hook_field_delete() called with the correct stub {$obj->ftid}");

Single quotes around the assertion.

+++ modules/field/modules/field_sql_storage/field_sql_storage.module	6 Aug 2009 04:25:26 -0000
@@ -337,39 +350,54 @@ function field_sql_storage_field_storage
+    // We need to require objects arrive in a consistent order for
+    // range() operation to work.

Probably 'ensure that' objects arrive?

+++ modules/simpletest/tests/field_test.module	6 Aug 2009 04:25:26 -0000
@@ -640,4 +640,28 @@ function field_test_memorize($key = NULL
+/**
+ * Memorize calls to hook_field_delete().
+ */
+function field_test_field_delete($obj_type, $object, $field, $instance, $items) {
+  $args = func_get_args();
+  field_test_memorize(__FUNCTION__, $args);
+}

I know these weren't introduced in this patch, but could you remind me what this is again?

This review is powered by Dreditor.

Comment #51

bjaspan CreditAttribution: bjaspan commented 11 August 2009 at 14:20

Status:

Needs work

» Reviewed & tested by the community

File	Size
field-bulk-delete-367753-51.patch	63.08 KB

New patch incorporating all changes from the previous two comments, except:

+++ modules/field/field.module 6 Aug 2009 04:25:26 -0000
@@ -178,6 +182,16 @@ function field_theme() {
+ * Purges some deleted Field API data, if any exists.

s/some//

No, "some" is correct there. It does not delete all of it.

Also, should $batch_size default to 0? that seems to be the most common value.

No. 0 may be the common value in the tests, but that is an unusual case. The argument is the number of objects that should be purged during a normal cron run. There is no correct default value other than what comes from variable_get('field_purge_batch_size') or whatever we end up replacing field_cron() with in a subsequent patch.

One other thing: I applied this patch and then installed CCK and it still didn't make fields delete. I guess this is because the purge stuff was missing. But it makes it really hard to "test" this functionality without a UI around it to see how it affects users.

I don't understand this comment. Granted, I haven't tried the CCK UI recently... are you saying there is no delete functionality there at all? If so, that's a flaw in the UI code. All the changes in this patch should be completely invisible to users, unless the users inspect their database directly, because deleted fields, instances, and data already work before this patch; this patch just arranges for the deleted stuff to actually be cleaned up from the database.

It's kind of crazy that we jump through these kinds of hoops for fields, when we don't jump through them for anything else: nodes, files, etc. But I guess you have to start somewhere. :P

Yes, indeed. We tried to do several things with Field API that Drupal needs to do better everywhere, hoping to set the example for the future. Properly handling deleted data is one example, one that happened to turn into a real PITA. :-)

There are no real functional changes in this version, so I'm setting it back to RTBC. Hopefully the bot won't prove me wrong.

Comment #52

yched CreditAttribution: yched commented 11 August 2009 at 14:47

Minor : field_purge_batch() mistakens the function for a batch API callback :-/. Same old problem with APIs / modules stealing an existing, current language word instead of adding lingo...
I don't see much use cases for that function to be called directly, except here in hook_cron, and possibly in field UI for an actual 'batch API' feature ("purge all data *now*" after a field / instance has been deleted in the UI), so I'm not sure the function name matters that much to the outside world; while internally the confusion with batch API is technically unfortunate IMO.
field_purge() was kind of OK with me :-). Maybe field_purge_deleted() if we don't want 2-word functions ?

Comment #53

Dries CreditAttribution: Dries commented 11 August 2009 at 15:00

Status:

Reviewed & tested by the community

» Fixed

Excellent work -- yched and bjaspan continue to impress. Committed to CVS.

Comment #54

rfay

English

Palisade, CO, USA

CreditAttribution: rfay commented 12 August 2009 at 17:29

Status:

Fixed

» Needs work

I wanted to report a potential issue with this commit. My existing D7 install from before this commit (no updates needed) broke:

Visiting /user I get:
Fatal error: Unsupported operand types in /home/rfay/workspace/d7git/modules/field/field.info.inc on line 264

after this commit. I tried to completely debug it but ran out of time - I apologize. I did find that in field.info.inc, _field_info_collate_fields() starting at 216:

    foreach ($definitions['instances'] as $instance) {
      $field = $info['fields'][$instance['field_name']];
      $instance = _field_info_prepare_instance($instance, $field);

$field is null, and that's the issue. I don't understand the data structures well enough to understand *why* it ended up null.

I do have the database that demonstrates the problem and will provide if if you're interested.

I do *not* see the issue if I do a clean install with today's D7.

Comment #55

bjaspan CreditAttribution: bjaspan commented 12 August 2009 at 17:33

Status:

Needs work

» Fixed

@rfay: We do not guarantee or provide that HEAD continues to work with existing databases across commits, even if no update functions are needed, and in fact we have a policy against writing update functions for such cases even if they are possible. Thus, what you are reporting is not a bug. Sorry, you have to wipe your db and start over.

Comment #56

26 August 2009 at 17:40

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.