AI Shows Me What Success Looks Like And Then Doesn't

by John Sharpe
October 10, 2024
by John Sharpe
October 10, 2024

This is my third rewrite of this blog post – same content but I’ve completely shifted the focus each time. Broadly, it’s about a hackathon – 2 developers, 8 hours and 1 idea. Slightly less broadly, my colleague Greg and I had been chatting over an idea for a while, had spent a lunchtime refining it on a whiteboard and, finally, took a day out to dive into it and see how far we could get.

I’m finding this hard to write about because it was a really full day. There are a few points about which I could churn out a couple hundred words:

  • Starting a project from scratch and our technology choices
    There might be some content there but it’s a bit dry. Every developer has probably fought this battle multiple times, many have written about it. I don’t think I could make a good enough contribution on this front. The answer was Python, by the way. Just Python. A couple of libraries came into play but we always went back to standard-library Python.
  • The experience
    I don’t think I’ve spent so much of a workday giggling. Of course, I can’t write about this. But I will comment that it was an absolute blast from start to finish. Obviously it’s nice to have a break from regular work, but the nature of the project meant the feedback loop between writing something and trying it out was tiny. Also, learning to use AI tools was generally fun. I can think of three reasons for this:some text
    • It’s an English language interface, so there’s no barrier.
    • It’s non-deterministic to an enormous degree and reasons for each result can’t be readily understood or predicted (a.k.a magic).
    • There’s so much hype around AI that using it feels… important, relevant.
      I think this all filters through in the following text.
  • AI’s actual performance
    I’m going to touch on this one a lot, but I don’t care to focus on it in any formal sense. Honestly, I don’t know how to judge the quality of the results I got back. I’d imagine it’s a complicated topic which combines statistics and psychology in tricky ways.

So what am I writing about? I think what I’m going for is how using AI felt – both to help develop the application (ChatGPT) and as a part of the application (Open AI Assistants). I’m happy to see bad results, but were they bad for good reasons? Would I see the mistake and adjust the prompt accordingly, or would I struggle to move forward? I’m personally just starting out here – can I teach myself how to get the best results organically?

ChatGPT’s usefulness was extremely variable. At times, I was surprised at how sensible the results were, at other times disappointed that it seemed not to “want” to help. Hopefully this range is demonstrated by the following examples. Using an OpenAI Assistant, on the other hand, was quite a success! I’ll come back to this at the end.

First, a bit more background to the hackathon

AI might be the most important abbreviation of the 2020s and possibly the next bajillion years. Ever keen to stay ahead of the curve, a considerable amount of OpenCredo attention and effort has been spent pondering this subject and looking for ways to demonstrate our technical expertise. A promising internal project is coming together but its scope is far too broad to see tangible results in the short-term.

I’m interested in Cold War history and came up with an idea for a simple, text-based game which uses Generative AI as a component – just a piece of the puzzle, not a jigsaw factory. Greg is far more educated, experienced and excited than I am when it comes to AI/ML and he was an obvious choice of confidante from the start. We spent a lunchtime building up a player’s story from start to “game over”.

After a passionate pitch to OC, we got the response “you can have a day to get this out of your system, then get back to work”.

The project

Spies!

Here’s the idea for the game: You are an MI6 spymaster working East Berlin in the height of the Cold War. The KGB has recently dispatched an undercover operative to hunt down and eliminate Western intelligence agents, with staggering success. With a list of suspects, your job is to identify and assassinate this new threat. Use your agents to spy on people, track their movements and listen in to their conversations. From who they meet up with and talk to, the clues they give about these people and about themselves, you can move through the suspect list and eventually find the KGB agent. Beware though, the KGB are closing in on your people too…

Behind-the-scenes, this looked like the following:

  1. Generate 50 “people” – classic German names, East Berlin addresses and jobs
  2. Build a graph out of these people with 1 node per person, with a bottom-layer of 1 (a.k.a the boss) and a lot of rules about how it’s connected
  3. The game is now ready to play. You have 5 agents, each “watching” a target in the population (randomly chosen to start with)
  4. The user is invited to interact at this point
    a. Generate! A period of time passes, this action gets a new batch of transcripts from all targets. This is the meat of the game. The plan was to use an Assistant to give us the transcripts. The rest was, hopefully, just a little bit of logic to stick it all together.
    b. Reassign! Set your agents to new targets
    c. Review! Re-read all transcripts about a single member of the population
    d. Kill! You think you’ve identified the boss? Assassinate someone and see what happens!

Introduction to working alongside AI

PyCharm > File > New Project...

So far so good. Now to get 50 German names!

“This is not a hard problem” I thought, before heading to Google, Wikipedia etc. I found and copied a few names before I started huffing and puffing. 50 didn’t seem so small a number anymore. “Are you not using ChatGPT?”, asked Greg. I don’t have the instinct to see ChatGPT as a general-purpose solution. It’s often useful, especially when you’re writing software, to ask yourself “what am I trying to do?”. But it’s unfamiliar to me to take that answer out of my brain, from abstract thought to English language, and type it into a browser. Admittedly I’m a little bit cynical but I had to admit this was a perfect application! Here are the prompts I used and some comments on each response.

Give me a list of 50 traditional, German names.

The results were, to be generous, horrible. Sure I got 50 names, but just first names. There were other issues, it came naturally to tackle these problems one-by-one.

Give me a list of 50 traditional, German full names.

OK, that’s a bit better, but many first names and last names were repeated*1. Sometimes the full name was repeated entirely.

Give me a list of 50 traditional, unique, German full names.

Frustrating. Full name repeats were removed, but first names and last names were often reused.

Try again, but make sure there are no repeated first names or last names.

Better, they were now fully unique! However, the gender split was poor, almost all of these names were male.Success? Well there was a list which looked OK but that’s the classic ChatGPT problem… They looked like German names. “Günther”, “Jürgen”, “Monika”, “Brigitte”, “Anneliese”, “Elke” etc. Definitely a healthy number of umlauts there. The software developer in me resisted the urge to say “we’re done here, let’s move on”, but that was a fault. I was only making a game: “looking good enough” was actually “good enough” in this case.

In the end, maybe 3 minutes had passed, so I’d definitely made a saving from manually compiling the list. I was fortunate to have that context though, I’m probably still too suspicious to use this tool in any professional capacity. I wonder if I’ll be saying the same thing in 3 months time. Notice that my approach becomes slightly more conversational after a few rounds – I have no idea if that helped but it felt like the right thing to do, especially when the responses all ended with fluff like “Here’s a list of 50 German names that have been in popular use throughout the 20th century.” Is ChatGPT tuned to expect this sort of thing? Does it perform differently when you expect it to “remember”*2 parts of the thread? All unclear. Weirdly, these English-language sentences make me more doubtful of the results. To me, it’s a reminder that this agent is trying to convince me to give positive feedback, rather than a truer answer that might not look so friendly.

I’d like to note that, once I’d decided to move on, I felt incredibly satisfied with the interaction. Rather than learning how to use a tool and leveraging it, it was like I had asked a colleague to do some work for me and they had already done something so similar I could grab the same results. In the same vein, the things I had to fix with each re-query felt reasonable.

No, please listen to me

With my list of 50 names banked, I turned my attention to addresses and started the same way. It struck me as a similar task.

Give me a list of 50 addresses in 1961 East Berlin

So far, not too bad. The format was a consistent <number> <road>, <district> e.g. 46 Warschauer Straße, Friedrichshain and the districts were correct (according to the Wikipedia page). At this point I had the same questions in mind – was there ever such a place? If there was a “Warschauer Straße”, did it have a 46th building? Was it in Friedrichshain? But this time I was more prepared to take something that felt right and not worry too much about absolute accuracy.

This list wasn’t quite ready yet though. I knew there were 11 districts and, though I was expecting more addresses to be in some than others due to size and housing density differences, I wanted them all to be represented.

Give me a list of 50 addresses in 1961 East Berlin, with some representation from all districts

Better, but only 5 districts were represented.

Try again, but make sure there is at least one address from each district

A similar set of results, the same 5 were represented in the same proportions – most addresses were in Friedrichshain, Mitte and Prenzlauer Berg

Try again, but make sure there is at least one address from each of the 11 districts

A similar set of results. Maybe ChatGPT had “forgotten”*3 what it’s retrying?

Give me a list of 50 addresses in 1961 East Berlin, with at least one address from each of the 11 districts

Ugh. A similar set of results. One address was in Marzahn.

Give me a list of 50 addresses in 1961 East Berlin, with at least one address from each of the 11 districts: Friedrichshain, Lichtenberg, Mitte, Prenzlauer Berg, Weißensee, Treptow, Köpenick, Pankow, Hellersdorf, Marzahn, Hohenschönhausen

A similar set of results. What is going on? Has ChatGPT fallen into some loop and I’m trapped (I’m not convinced that even can happen). I opened a new tab and tried that last prompt and, yet again, got a response with the same problem. I don’t think I can do any better than this.

I gave it a good few reruns, slightly tweaking the wording as I went, but it just wouldn’t budge. Fine, time for 11 queries…

Give me a list of 5 addresses in Friedrichshain, East Berlin, 1961
Give me a list of 5 addresses in Lichtenberg, East Berlin, 1961
...
Give me a list of 5 addresses in Hohenschönhausen, East Berlin, 1961

This is how I eventually got a list together. Very disappointing, especially since I’d been wrestling with it for a good 20 minutes. All the satisfaction I had felt from generating names had died away. That last step where I put together a few sublists broke the spell – it was no longer somebody handing me completed work, it was just a different kind of work I had to do.

As I write this post though, looking back I can’t help but feel like I got some value from ChatGPT. I would’ve taken the same amount of time although I’d have felt more confident that each of the addresses had been real at that point in history. Perhaps I’d already developed a reliance and this was real life slapping me to wake me up. But at the same time, I didn’t have to dig around the internet and pick out addresses, reformatting each time. That’s tedious work. And the addresses I ended up with did capture the mood in the same way as with names. They sounded East German, 1961. On later review, I recognised some bits of them (e.g. Alexanderplatz), and for those I didn’t – a nice scattering of umlauts and eszetts.

Hard graphs

So at this stage, I was all square with ChatGPT. Names went well, addresses did not, a few more tasks in between had middling results. Time to connect up the population as per step 2 in the behind-the-scenes list above. We deliberated about the graph we wanted and drew our intentions on the whiteboard, then sat down to a laptop to see if we could get ChatGPT to give us some Python to create it. This would be a fairly complex request. I’ve seen examples where the tool is asked a question and can’t give a correct answer. But when asked to generate code to do that task, what you get really works. We tried many, many rounds of rewriting/expanding/paraphrasing the following:

javascript
Create a python function which, given a set of 50 nodes, connects them to make a graph of 4 layers. The first layer has 1 node. It is fully connected to the second node which has 5 nodes. All members of the second layer are connected with each other. There are some connections to the third layer, which has 14 nodes. There are many connections between the third layer and the fourth, which is made up of all remaining nodes.

Phew. That’s a lot of words and it’s not even everything we wanted out of the graph! Unsurprisingly, ChatGPT wasn’t quite up for it. It consistently returned Python code which created a graph, which is an incredible feat. Often it suggested including non-standard libraries which were often used to make graphs (and one which drew graphics) but weren’t particularly helpful in this case. This was a stark reminder of the nature of LLM training: it’s not hard to imagine a relatively high proportion of training data which uses the words “python” and “graph” also include the names of these libraries. We eventually gave up on trying to get ChatGPT to solve this problem for us, convinced that we just didn’t know how to ask for what we wanted. The graphs generated just weren’t connected sensibly. With about an hour’s work, we had handwritten enough Python to do everything we wanted, like cavemen.

Spying on people

To carry the game, we might have written many pages of text or maybe “bits” of text and some logic to compile them for your agents’ surveillance reports. This would’ve taken us a huge amount of time and put a fixed “cap” on the quality of the product. Using an OpenAI Assistant, we were hoping to quickly put something together with massive entropy, so the game would be different and interesting every time it was played. It didn’t take long to get up-and-running and even the very first few attempts showed promise. Full credit goes to Greg here, who knew exactly what he was doing. Prompts came together with something like this:

prompt = f'The targets details are as follows:
Name: {person["name"]},
Job: {person["job"]},
District: {person["address"]["district"]},
Connected to: {connections_sample}

There were a few simple problems, for example the reports would occasionally end with something like “Continued surveillance is recommended”. I can understand (or at least appreciate) why that sort of text was generated. However, we added some extra specification like “Do not make recommendations about future activity” and it instantly did the trick. Another slight issue was the mood. The characters all seemed quite happy with their lives, which was far from the gritty, grey tone we were going for. Again, a quick addition to the specification fixed that right away – we explicitly mentioned the tone and the results were a good, direct reflection of that:

The characters are all staunch communists. Make any conversations engaging and dark for a sombre, realistic tone.

One thing we couldn’t get away from is the ‘feel’ of an AI response though I’m not entirely sure what that means. I suspect there are some measurable traits like sentence length, but there was something very slightly ‘samey’ about the transcripts. As before, I expect adding more detail to the Assistant specification will do the trick.

In conclusion

All told, it’s clear that I’m not very good at writing prompts for ChatGPT. My first challenge is to stay aware that it might be a way to get to a solution. It’s always available if I’ve got an internet connection. From there, it feels like you have to ‘get to know’ the tool to foresee some problems. I look back at the “names” and I feel like I wouldn’t make the first-name/full-name mistake 4 times in a row again. With both addresses and graphs, it does feel as though a little more practice would help me to ask the right questions and get the most value out of an amazing tool.

We were very lucky with our context. Maybe this is just how AI should be used at the moment: it’s just a game, if the Assistant gives an insane response nobody will lose anything of value. At the same time, if the responses aren’t 100% accurate, as long as we maintain the mood no harm is done. If you’re looking for a way to get to grips with Generative AI, I’d go as far as recommending using it to make a game. If you keep focused on the bigger picture – things like tone, atmosphere – you can throw together something quite effective while learning how best to use these tools and getting a better appreciation for where they shine and where they are not (yet) so useful.

*1 Later on in the project, a few repeats might be desirable: “Which Karl is my target talking about?” – but for now we wanted the names to be absolutely unique. This was a conscious decision we made before starting.

*2 I shudder when using words like “remember” and “forget” with ChatGPT – it’s not a brain. But it is a convenient way to say the piece of information no longer has the demanded impact on response generation. I’ve never heard the same terms applied to caching but I suppose it might work.

*3 I did it again. You know what I mean.

*4 I notice I’m talking about it as though it’s an API… Maybe I didn’t make a “mistake” per se, both a “first name” and a “full name” is an appropriate answer when asked for a “name”.

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.