Problem/Motivation
The AI Agents demoed at Barcelona worked for the demo but didn't work 100% of the time. We need real statistics and tests of how these AI Agents would work with the marketeer persona or the ambitious sitebuilder. This can help us improve the quality of the agents but also help us demonstrate the effectiveness of the Agents to the wider community.
This is blocked by the Evaluatios module being ready: #3487007: [Meta] Create an alpha version of evaluations used to test Drupal CMS
Proposed resolution
We conduct a number of tests.
- We create a script with a number of scenarios with associated tasks for the end-user to complete using Drupal. This will be used to conduct the tests.
- We create a test site that is standard Drupal CMS with the AI module and Evaluations module/ recipe used for recording the tests.
- We make sure the tester knows that anything they type will be anonymously logged, recorded and may be sent to a central permanent place for analysis and building a library of successful prompts.
- We ask them to log into using provided credentials.
- We give them a URL to start with, with the chatbot open (A later test can test it end-to-end, but for now its focused on the chatbot itself and its ability to answer prompts).
- We ask the tester to go through tasks in the scenario and evaluating them ticking the green thumbs up when it works well and the rest thumbs down when it works badly.
- We conduct a short interview at the end asking them if they think it worked well, if they liked using it and found it helpful and any other comments.
- We gain consent again for exporting all of their prompts and allow them to see it before we send off a CSV of the full evaluation.
- We import this into a site that can run analytics across all the evaluations to report on what went well and what didn't.
We conduct tests in three phases.:
- Phase 1: Initial exploratory test with a single person to see that the script works and the evaluations software works.
- Phase 2: Main controlled test with selected people, ideally in person, if not done online with shared screens
- Phase 3: Wider test that allows anyone online to try it out and submit evaluations. Will have to know if they fit the persona based on trust.
Phase 2 and Phase 3 will likely need to be seperated results.
Current Questions/ Script:
Testing goals:
- To understand how participants naturally expect to use the AI assistant in the context of supporting them to achieve given tasks
- To understand how participants rate the success of the AI assistant supporting them to achieve given tasks
- To use data enabled by the evaluations module to make success measurements
Potential tasks that the user could perform with the AI assistant:
- Create a recipe content type
- Add image field to an existing content type
Potential scenario (with associated tasks): You run a community group and you’ve taken the first steps to set up a website to keep people informed about what you do.
Task 1:
So far you can publish textual news content on your site but you now need to able to add images to the news items you publish. How would you use the AI assistant to help you make sure you can add images to future news articles you publish?
Expected result:
Participants type in a query along the lines of how to add image fields to content types (using their own words)
Follow up:
How successful would you say the AI assistant was in helping you achieve your task?
Expected result:
They give the AI a thumbs up/down based on the help it provided
Task 2:
In addition to the news articles you want to be able to publish longer-form pieces which give information about collaborative community projects you’ve worked on, to include things like logos and links to the partners you have worked with. How would you use the AI assistant to help you do this?
Expected result:
Participants type in a query along the lines of how to add a project content type (using their own words)
Follow up:
How successful would you say the AI assistant was in helping you achieve your task?
Expected result:
They give the AI a thumbs up/down based on the help it provided
Comments
Comment #2
yautja_cetanu commentedComment #3
yautja_cetanu commentedComment #4
emma horrell commentedComment #5
emma horrell commentedComment #6
emma horrell commentedComment #7
yautja_cetanu commentedWe've had a proper test of the AI Agents and they are mostly working. Most of the issues were problems with the drupal UI itself and the testers being confused by drupal or people getting confused by the goals of the test. We tried to make this one very simple with specific tasks to do.
The next one we will try and ask then to be more creative like create a recipie page so they might explore what fields they want instead of us telling them.
One major issue is the level of preview / review steps. Generally I'm seeing drupal developers are happier when the preview steps are more accurate but those tend to confuse people because they are then told a whole bunch of drupal terms before they get to see what it looks like. The review steps again are nice for Drupal developers as they give you links to all the places where it's created something but for someone with no Drupal experience this was scary and they were worried that if they didn't write down those links they would be gone for good.
But it's nice that we are getting to a stage where improvements will be working on the prompts and flow rather than fixing bugs in code.
We've also had developers try them out and found it hard to get everything installed. So we will work on that next week including improved documentation. I think we'll make a video to show people how you can install the recipe and get started.
Further write up to follow
Comment #8
emma horrell commentedWrite up of tests available in these blog posts:
https://blogs.ed.ac.uk/website-communications/category/drupal-ai/
Comment #9
pameeela commentedLooks this was done so marking this fixed.