Due to popular demand, I've documented how this can work.
Technically most of the functionality was there under the hood already, but now it's been tweaked to recognise CCK fields explicitly.
The base functionality supports placing found content into the $node->body
field, not naturally into any arbitrary CCK fields, but this is also
possible.
If you have a CCK node with (eg) fields:
field_text, field_byline, field_image
and your input pages are nice and semantically tagged, eg
<body>
<h1 id='title'>the title</h1>
<div id='image'><img src='this.gif'/></div>
<h3 id='byline'>By me</h3>
<div id='text'>the content html etc</div>
</body>
A mapping from HTML ids to CCK fields will be done automatically, and
the content should just fall into place.
$node->title = "the title";
$node->field_image = "<img src='this.gif'/><";
$node->field_byline = "By me";
$node->field_text = "the content html etc";
In fact, ANY element found in the source text with an ID or class
gets added to the $node
object during import, although most
data found this way is immediatly discarded again if the content type
doesn't know how to serialize it.
A special-case demonstrated here prepends field_
to known
CCK field names. Normally they get labelled as-is.