Drupal 10, the latest version of the open-source digital experience platform with even more features, is here.Problem/Motivation
I'm just spit balling here, but this is something that might prove very useful. I have been researching how to do OpenAi embeddings. Since all the content in a Drupal site is in the database, it seems like we could create a vector table that has the vectors for each page. Then, using views, we could get the search vector of the input and now do a vector search using cosine or dot product to determine the results.
Steps to reproduce
https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
1. Create a vector table. Two fields: entity_id and vectors
2. Run an embedding against the node/paragraph content fields. Store the vectors in the vector table with associated entity_id.
3. Create a custom view filter that will
a. retrieve the vectors for the search term
b. run mathematical computation (cosine, dot product, etc..) of search vector against vector table
c. return text extract for highest result(s) with link to entity
Proposed resolution
A submodule of this OpenAI / GPT module which will create and maintain the vector table as well as programmatically create the custom views filter for doing embedded searches.
Remaining tasks
Creating and maintaining the vector table should be relatively simple. It could be done during cron and of course during node add/update/delete sessions.
The big problem will be determining the mathematical function for the vector search. Because of all the possible conditions - contains, does not contain, equals, less than, etc...
User interface changes
I don't see any user interface changes. All of the OpenAI calls would be invisible to the end user.
API changes
Clearly we would be adding OpenAI API embedding calls: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
Data model changes
The only real data model change is that we will do searches on vectors instead of keywords.
I'd like to figure out a way to do this with Search API / Search API Attachments as well.
Just an idea.
| Comment | File | Size | Author |
|---|---|---|---|
| #17 | Screenshot from 2023-02-05 12-03-35.png | 19.28 KB | kevinquillen |
| #10 | pararaphs_item_field_data.jpg | 197.39 KB | SomebodySysop |
| #10 | paragraphs_item_field_data.jpg | 197.9 KB | SomebodySysop |
| #10 | paragraphs_item.jpg | 117.26 KB | SomebodySysop |
| #10 | paragraph__field_content.jpg | 181.5 KB | SomebodySysop |
Issue fork openai-3337774
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:











Comments
Comment #2
kevinquillen CreditAttribution: kevinquillen at Velir commentedThis is actually a good idea worth looking into. It is this sort of application of OpenAI I am interested in - augmenting Drupal in assistive ways.
Comment #3
kevinquillen CreditAttribution: kevinquillen at Velir commentedLooking at the API, it does a POST request:
https://platform.openai.com/docs/api-reference/embeddings/create
My question, how does that work exactly? What does that mean for example for a sites content? I suppose all content published is public generally speaking, but is there a risk for content that is access protected, unpublished etc? Can people create their own private models? If you submit to this API is your content then in a public model?
For larger sites it might run up costs, so perhaps we start by limiting it to a small set of nodes, especially during development.
Comment #4
SomebodySysop CreditAttribution: SomebodySysop commentedWhat we are talking about is text vectorization, a numeric representation of the text. https://www.deepset.ai/blog/what-is-text-vectorization-in-nlp. Essentially, a way to make it easier for machines, specifically language models, to understand the text.
So, on a site we have a page with content. If it's a node, that content is stored in the
Drupal database with a node id. If it's a paragraph, that paragraph is stored in the Drupal database as a paragraph id. When we run an embedding, essentially we create a vector representation of that page and store it in the database, referring back to it's source id (i.e., the node/paragraph text that was vectored).
So, for example, when a page is created or updated, we run the API to get the vectors for the page content, then store those vectors with the entity_id of the added/updated page. That table row is what we use to do the vector searches on page content (as opposed to the sql keyword searches). We actually run a mathematical function to compare the search vector with the stored vectors as opposed to a database search.
I don't see any change. End users have access to site pages based upon their access level. The vector content is never seen nor accessible by the user. It is only used for searching.
No. The page content and it's access remains the same. The embedding process makes API calls to the OpenAI embedding model to retrieve the vectors. That information is not stored on the model. The model simply "vectorizes" the content and sends back the vectors. The page data is never in a position, as far as I can tell, to be seen or accessed outside of this process. Same for the vector information. Unless OpenAI is storing this information to do model training. That, I don't know. So far I've not been able to find anything on embedding security.
Update: I actually asked the ChatGPT3 module this question, and this is these are the responses I received: https://sharegpt.com/c/TF7wLOy
At this point, there aren't any custom embedding models. At least as far as I know. The available models are listed here: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
I know that people can create their own private models in fine-tuning, but that's not what I am suggesting here. I am suggesting that we use OpenAI embeddings to vectorize our content in order to perform vector searches that no longer rely on keyword matches to locate relevant data. This is one of the best explanations I have seen on the subject: https://www.youtube.com/watch?v=xzHhZh7F25I&ab_channel=PartTimeLarry
To be clear, I am not suggesting this process as a core feature. I would restrict it to an optional submodule of this OpenAI / GPT module. Therefore users who want this enhanced search feature on their data would certainly be willing to pay for the usage.
Which means that the submodule would require the user's API key. It would also give them the option of selecting the public embedding model they wish to use:
On this page, I only see one model: https://openai.com/api/pricing/
Ada $0.0004 / 1K tokens
Comment #5
kevinquillen CreditAttribution: kevinquillen at Velir commentedThanks for the explanation, just wanted to throw some questions out there. I have an idea now. I may not be able to pivot to this for a few days.
Would you be able to contribute to the forked branch? https://git.drupalcode.org/issue/openai-3337774/-/tree/views-embedding
Comment #6
kevinquillen CreditAttribution: kevinquillen at Velir commentedYep, thats the goal for most of these features. I have created an
openai_embeddingsmodule in the forked branch. Open to 'better' naming scheme(s).Looking through the docs, they combined a dataset into a single row per record in their example:
https://github.com/openai/openai-cookbook/blob/main/examples/Obtain_data...
In this case, should it only be entities like nodes, and not ones like Paragraphs? One reason I ask is because Paragraphs last I checked are not deleted, for example. Typically, there are no such things as orphaned paragraphs either, they (the active ones) are attached to a host. Should the node be loaded into a specific view mode that is just a textual representation of the node content? Then feed that to the API?
Comment #7
SomebodySysop CreditAttribution: SomebodySysop commentedI actually didn't know that about paragraphs. I know you can migrate them without being attached to nodes, but not that they aren't deleted when their parent nodes are deleted.
I mentioned nodes and paragraphs because those are the primary entities that I am familiar with when working with site text content. Depending upon the application, a site content type could contain one or more or no body text fields. In my sites, I usually place the content in paragraph body text fields. Other folks use nodes. So, our module would have to accommodate this.
For example, this field:
machine name: field_content
field type: text (formatted, long)
entity type: paragraph
entity id: paragraph id
parent entity: parent entity type (typically a node)
parent entity id: parent entity's id
Will be stored in the vector table like this:
vector entity type: node/paragraph
vector id: node/paragraph id
vectors: the actual vector array (stored as serialized blob, as per my understanding)
I'm not sure if we need the parent entity info. It depends upon how the search is structured -- how we retrieve the target entities for search. In SQL, we would join the vector table on the vector table id = paragraph/node id.
But, we don't care about the ids until AFTER the search (or in our case, mathematical function) is completed.
We run a dot product or cosine calculation between the search vector and all rows in the vector table. Then we take the highest matches (1, 2, 3, whatever the user decides), and THEN we retrieve the node/paragraph entities from those results based upn the node/paragraph ids.
Good question. Right now, in the Drupal field content tables, the text fields contain the raw HTML text content. The question is: Do we strip the HTML tags from the content before we embed it? Logically, it would seem like we should in order to get the best semantic search results, but it would take someone with more AI embedding knowledge than me to determine that.
Comment #8
SomebodySysop CreditAttribution: SomebodySysop commentedSo, right now with the model we are discussing, our results will bring back full pages. I was thinking through how we might bring back an excerpt instead. Then, thinking beyond that to using something similar to a ChatBot interface.
Check out this video:
https://youtu.be/fODk-alDqWw
What the guy does here is:
So now the Drupal user can get ChatGPT3 responses to queries on his/her content without having to fine-tune an OpenAI model.
All done through normal Drupal database transactions and APIs to OpenAI.
Add Audio to text and text to Audio, and now we have a full Drupal bot.
Comment #9
kevinquillen CreditAttribution: kevinquillen at Velir commentedWould this have a performance impact during search at all? I was looking to see if Drupal 10 had support for JSON storage types in the database but could not confirm that yet.
In either case, yeah, I think the storage will look something like:
- id (entry id)
- entity_id
- entity_type
- bundle (unsure if needed)
- field_name (what the vector was generated from)
- embedding (the vectors)
- data (raw response from OpenAI, for debug or logging purposes)
with db indexes on type, bundle, field_name. Ex. a node with 4 text (plain or long) fields would have 4 entries. What about field deltas? Do we need to track ordering?
The administrative interface also needs a way for a user to opt in which entity types and their bundles to run this on. Would we also need to allow specifying which text fields to run on? Maybe we do them all initially and make opting fields in an enhancement later.
The first pass I will try is creating a job in the queue when a node is saved and then work at a processor.
Comment #10
SomebodySysop CreditAttribution: SomebodySysop commentedCompared to current keyword searches? Probably so. We've got to do the API call to get the search vector then the mathematical computation on the vector table -- and possibly a 2nd API call to get a response with context. We won't know just how much of a noticeable impact until we start testing with real data.
Yes. I don't know that field deltas matter in our scenario.
As for ordering, remember that the vector computation (dot product, cosine similarity, etc...) will return vector records that we will order according their value relative to the search vector.
I think we need to think of it from the standpoint of what we need to get back to the entity we are actually going to display to the end-user. Let's take my case, where all my text content resides in paragraphs.
When we embed the content, we use the field_content_value (text) from the paragraph__field_content_table. Now paragraph entity 11294 has a vector record and id.
When we search, and paragraph vector 11294 ends up the highest result, we use the paragraphs_item_field_data table to point us to the parent entity node that we will link users to in the search results.
In any event, I say let the end-user decide which fields to use with the warning that the more fields that are used, the greater the potential latency.
Comment #11
kevinquillen CreditAttribution: kevinquillen at Velir commentedSo I started with a basic node type that has a title and a body. For the title, I named it 'Test Article'. The embedding returned for that was 1536 numbers - should it be that many?
Comment #12
SomebodySysop CreditAttribution: SomebodySysop commentedYes. The vector arrays will be very long.
Comment #14
kevinquillen CreditAttribution: kevinquillen at Velir commentedGotcha. I've put some rough work in here so far:
https://git.drupalcode.org/issue/openai-3337774/-/tree/views-embedding/m...
I also added the field delta, not for any other reason but to be able to use a db merge query. Wasn't sure of more efficient ways to ensure the data in the table is not duplicated and or fresh. Thinking ahead to something like Paragraphs.
Would this also be possible to make the module able to suggest taxonomy terms (either existing or not) to classify the text data?
It may be possible to ditch serialized array storage for a JSON document. Unsure of where 10.x core lands, but would be much more performant for data like this.
https://www.drupal.org/project/drupal/issues/3276818
Comment #15
SomebodySysop CreditAttribution: SomebodySysop commentedYes. Embed and store the taxonomy terms in the same way, then run a dot product / cosine similarity function with the search vector. Same principle. Pick the highest results.
By the way, I'm not sure which is better: dot product or cosine similarity or some other function. Most of the examples I've seen use cosine, but it was recommended to me by someone knowledgeable to use dot product. Guess we'll see.
Comment #16
kevinquillen CreditAttribution: kevinquillen at Velir commentedSince we are building for Drupal 10, it looks like we are now permitted to define schema types as JSON (updated minimum database versions for Drupal 10):
This means values are now stored as JSON documents in the database:
This should prove fruitful when performing lookups or searches vs storing as serialized text. This may also mean having to maintain a query service for MySQL and one for PostgresSQL (they vary slightly in this area), but I think its still worth doing.
I know its a small sample set, but this query executed locally in 1ms for me for example:
This I believe means we can query into this without first extracting records, unserializing and then looking at that data.
Comment #17
kevinquillen CreditAttribution: kevinquillen at Velir commentedComment #18
kevinquillen CreditAttribution: kevinquillen at Velir commentedIt may be useful to add an "Embeddings" tab on a node/entity too and list out records in a table display so an admin can see what they have at a glance. This may be useful for monitoring token usage among other things.
Comment #19
kevinquillen CreditAttribution: kevinquillen at Velir commentedHere is another example, looking for a specific value (or could be a set of values):
SELECT JSON_EXISTS('{"data":{0.004663495}}' , '$.data') FROM openai_embeddings oeComment #20
kevinquillen CreditAttribution: kevinquillen at Velir commentedA couple of things here.
Calculating cosine similarity in MySQL seems to require (pretty much every example I saw) creating MySQL stored procedures. Many (Drupal) hosts disallow this. The next thing I looked for was a PHP implementation. I adapted this into Drupal:
https://github.com/mlwmlw/php-cosine-similarity
To test this, I wrote some example code:
I am not entirely sure I am doing that correctly. The value of $compare is a vector generated from "Test Article" string, previously. The result of ::cosine was 0.49932475388291.
I guess the issue I see here is how quickly can every vector be iterated over in the database in order to do this. I am imagining cases where there are thousands of entries, each one has to be pulled and ran through this function and then sorted highest to lowest, for example.
I can see they recommend a few engines for vector searching, the only one I recognize is Redis.
If I rework that a little, it would look more like this:
resulting in
Comment #21
kevinquillen CreditAttribution: kevinquillen at Velir commentedThere is also this but seems abandoned.
https://github.com/angeloskath/php-nlp-tools
Here is a newer looking lib:
https://github.com/yooper/php-text-analysis
Comment #22
SomebodySysop CreditAttribution: SomebodySysop commentedThe first examples I got from ChatGPT were SQL computations:
You're right. This streamlines the process of computing similarities.
This is definitely outside by knowledgebase, but I did get this from an expert on the subject a few weeks ago:
This is a question for ChatGPT.
Comment #23
SomebodySysop CreditAttribution: SomebodySysop commentedI just checked with ChatGPT:
Comment #24
kevinquillen CreditAttribution: kevinquillen at Velir commentedOpenAI themselves seem to say that the cosine similarity is good enough:
Little outside my wheelhouse there, though.
Comment #25
kevinquillen CreditAttribution: kevinquillen at Velir commentedHmmm... I better slow down.
OpenAI\Exceptions\ErrorException: Rate limit reached for default-global-with-image-limits in organization org-xxxx on requests per min. Limit: 60.000000 / min. Current: 70.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method. in OpenAI\Transporters\HttpTransporter->requestObject()Comment #26
SomebodySysop CreditAttribution: SomebodySysop commentedYeah, I just watched a video on this recently. I can't remember how the guy worked around it.
But, this is something end-users will need to be aware of as well. This search capability is not something they will want to randomly make public to anybody -- unless they are a fairly large organization can can afford the tokens.
I can verify this. Early on I tested cosine similarity and dot product and the results there pretty similar. Too bad we can't do it in SQL!
Comment #27
kevinquillen CreditAttribution: kevinquillen at Velir commentedComment #28
kevinquillen CreditAttribution: kevinquillen at Velir commentedGiven those recent items, I think we should split this into additional tickets.
I think, right now, we have set the foundation. We can store embeddings from text based fields on any entity (user is blocked, until we can write code / ensure things like passwords or PII data is not included, if user is done at all). From here, figuring out how to compare them effectively with broad technology (PHP or MySQL) may take some time and effort. But at the very least, we are collecting data.
Comment #29
kevinquillen CreditAttribution: kevinquillen at Velir commentedOpenAI states that the model we are using will have 1536 output dimensions.
See:
for examples... we need to identify a valid, working PHP library to help compare these vectors - or if a MySQL stored procedure can do the same. Most of the library examples I see are for Python. If we cannot locate one, we may have to make our own.
One of the libs I linked above for example: https://github.com/yooper/php-text-analysis/blob/master/src/Comparisons/...
This says it is a cosine similarity comparison, but it does not use two vectors of floats. Perhaps it does do that - but the arguments passed to it have to be strings, not arrays of numbers. I can't find too much else out there. If we do make our own implementation, it will need a good battery of unit tests to ensure accuracy.
Comment #31
kevinquillen CreditAttribution: kevinquillen at Velir commentedMerging current effort.
Comment #32
kevinquillen CreditAttribution: kevinquillen at Velir commentedBased on our previous comments and #29, I think we should create individual tickets that are the required pieces to get to the end goal.
Comment #33
SomebodySysop CreditAttribution: SomebodySysop commentedDid you see these?
Anyway, great work! Way more than I anticipated.
How do we do this? Since you now have the data collection methodology working, I'd like to add file attachments to that. I've used the SolrClient library in the past to do calls directly to the solr engine. My idea would be to get the solr file dataset vectors and add them to the same vector table for nodes and paragraphs. Then searches would retrieve not only site content but also attachments (pdf, docx, txt, etc...) content. I can work on that.
Comment #34
kevinquillen CreditAttribution: kevinquillen at Velir commentedYeah, I noted the first one (https://github.com/mlwmlw/php-cosine-similarity) and questions here:
https://www.drupal.org/project/openai/issues/3339686
The second one looks like it has not been touched in years, which may not be an issue if it 'works'. Will check through the shared GH repos.
Comment #35
kevinquillen CreditAttribution: kevinquillen at Velir commentedIn my head, I am figuring once we figure out the cosine sim / score generation, this opens up Views, taxonomy, etc features.
Comment #36
SomebodySysop CreditAttribution: SomebodySysop commentedDid you see this one?
https://github.com/openai-php/client
Comment #37
kevinquillen CreditAttribution: kevinquillen at Velir commentedYeah - thats what the parent module uses to communicate to OpenAI and provides all the functionality we are leveraging to date: https://git.drupalcode.org/project/openai/-/blob/1.0.x/src/Http/ClientFa...
I think, ultimately, we are looking at something like this now that I know a little bit more:
https://www.drupal.org/project/openai/issues/3339686#comment-14911421
Comment #38
kevinquillen CreditAttribution: kevinquillen at Velir commentedHop on over to this issue, I think we are on the right track: https://www.drupal.org/project/openai/issues/3339686#comment-14911959