[Meta] Conduct evaluations of the AI Agents
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3487016. -->
Reported by: [yautja_cetanu](https://www.drupal.org/user/626050)
>>>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>The AI Agents demoed at Barcelona worked for the demo but didn't work 100% of the time. We need real statistics and tests of how these AI Agents would work with the marketeer persona or the ambitious sitebuilder. This can help us improve the quality of the agents but also help us demonstrate the effectiveness of the Agents to the wider community.</p>
<p>This is blocked by the Evaluatios module being ready: <span class="drupalorg-gitlab-issue-link project-issue-status-info project-issue-status-1"><a href="https://www.drupal.org/project/ai_evaluations/issues/3487007" title="Status: Active">#3487007: [Meta] Create an alpha version of evaluations used to test Drupal CMS</a></span></p>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<p>We conduct a number of tests.</p>
<ul>
<li>We create a script with a number of scenarios with associated tasks for the end-user to complete using Drupal. This will be used to conduct the tests.</li>
<li>We create a test site that is standard Drupal CMS with the AI module and Evaluations module/ recipe used for recording the tests.</li>
<li>We make sure the tester knows that anything they type will be anonymously logged, recorded and may be sent to a central permanent place for analysis and building a library of successful prompts.</li>
<li>We ask them to log into using provided credentials.</li>
<li>We give them a URL to start with, with the chatbot open (A later test can test it end-to-end, but for now its focused on the chatbot itself and its ability to answer prompts).</li>
<li>We ask the tester to go through tasks in the scenario and evaluating them ticking the green thumbs up when it works well and the rest thumbs down when it works badly.</li>
<li>We conduct a short interview at the end asking them if they think it worked well, if they liked using it and found it helpful and any other comments.</li>
<li>We gain consent again for exporting all of their prompts and allow them to see it before we send off a CSV of the full evaluation.</li>
<li>We import this into a site that can run analytics across all the evaluations to report on what went well and what didn't.</li>
</ul>
<p>We conduct tests in three phases.:</p>
<ul>
<li>Phase 1: Initial exploratory test with a single person to see that the script works and the evaluations software works.</li>
<li>Phase 2: Main controlled test with selected people, ideally in person, if not done online with shared screens</li>
<li>Phase 3: Wider test that allows anyone online to try it out and submit evaluations. Will have to know if they fit the persona based on trust.</li>
</ul>
<p>Phase 2 and Phase 3 will likely need to be seperated results.</p>
<p><strong>Current Questions/ Script:</strong></p>
<p>Testing goals:</p>
<ul>
<li>To understand how participants naturally expect to use the AI assistant in the context of supporting them to achieve given tasks </li>
<li>To understand how participants rate the success of the AI assistant supporting them to achieve given tasks </li>
<li>To use data enabled by the evaluations module to make success measurements </li>
</ul>
<p>Potential tasks that the user could perform with the AI assistant: </p>
<ul>
<li>Create a recipe content type</li>
<li>Add image field to an existing content type </li>
</ul>
<p><strong>Potential scenario (with associated tasks):</strong> You run a community group and you’ve taken the first steps to set up a website to keep people informed about what you do. </p>
<p><strong>Task 1: </strong></p>
<p>So far you can publish textual news content on your site but you now need to able to add images to the news items you publish. How would you use the AI assistant to help you make sure you can add images to future news articles you publish? </p>
<p>Expected result:<br>
Participants type in a query along the lines of how to add image fields to content types (using their own words) </p>
<p>Follow up:<br>
How successful would you say the AI assistant was in helping you achieve your task?<br>
Expected result:<br>
They give the AI a thumbs up/down based on the help it provided </p>
<p><strong>Task 2: </strong><br>
In addition to the news articles you want to be able to publish longer-form pieces which give information about collaborative community projects you’ve worked on, to include things like logos and links to the partners you have worked with. How would you use the AI assistant to help you do this? </p>
<p>Expected result:<br>
Participants type in a query along the lines of how to add a project content type (using their own words) </p>
<p>Follow up:<br>
How successful would you say the AI assistant was in helping you achieve your task? </p>
<p>Expected result:<br>
They give the AI a thumbs up/down based on the help it provided </p>
> Related issue: [Issue #3487007](https://www.drupal.org/node/3487007)
> Related issue: [Issue #3467680](https://www.drupal.org/node/3467680)
issue