The AIDocumentLibraryChat project has been extended to generate test code (Java code has been testet). The project can generate test code for publicly available Github projects. The url of the class to test can be provided then the class is loaded, the imports are analysed and the dependent classes in the project are also loaded. That gives the LLM the opportunity to consider the imported source classes while generating mocks for tests. The testUrl can be provided to give an example to the LLM to base the generated test on. The granite-code, deepseek-coder-v2 and codestral models have been tested with Ollama.

The goal is to test how well the LLMs can help developers creating tests.

Implementation

Configuration

To select the LLM model the application-ollama.properties file needs to be updated:

spring.ai.ollama.base-url=${OLLAMA-BASE-URL:http://localhost:11434} spring.ai.ollama.embedding.enabled=false spring.ai.embedding.transformer.enabled=true document-token-limit=150 embedding-token-limit=500 spring.liquibase.change-log=classpath:/dbchangelog/db.changelog-master-ollama.xml ... # generate code #spring.ai.ollama.chat.model=granite-code:20b #spring.ai.ollama.chat.options.num-ctx=8192 spring.ai.ollama.chat.options.num-thread=8 spring.ai.ollama.chat.options.keep_alive=1s #spring.ai.ollama.chat.model=deepseek-coder-v2:16b #spring.ai.ollama.chat.options.num-ctx=65536 spring.ai.ollama.chat.model=codestral:22b spring.ai.ollama.chat.options.num-ctx=32768

The ‘spring.ai.ollama.chat.model’ selects the LLM code model to use.

The ‘spring.ollama.chat.options.num-ctx’ sets the amount of tokens in the context window. The context window contains the tokens required by the request and the tokens required by the response.

The ‘spring.ollama.chat.options.num-thread’ can be used if Ollama does not choose the right amount of cores to use. The ‘spring.ollama.chat.options.keep_alive’ sets the amount of seconds the context window is retained.

Controller

The interface to get the sources and to generate the test is the controller:

@RestController @RequestMapping("rest/code-generation") public class CodeGenerationController { private final CodeGenerationService codeGenerationService; public CodeGenerationController(CodeGenerationService codeGenerationService) { this.codeGenerationService = codeGenerationService; } @GetMapping("/test") public String getGenerateTests(@RequestParam("url") String url, @RequestParam(name = "testUrl", required = false) String testUrl) { return this.codeGenerationService.generateTest(URLDecoder.decode(url, StandardCharsets.UTF_8), Optional.ofNullable(testUrl).map(myValue -> URLDecoder.decode(myValue, StandardCharsets.UTF_8))); } @GetMapping("/sources") public GithubSources getSources(@RequestParam("url") String url, @RequestParam(name="testUrl", required = false) String testUrl) { var sources = this.codeGenerationService.createTestSources( URLDecoder.decode(url, StandardCharsets.UTF_8), true); var test = Optional.ofNullable(testUrl).map(myTestUrl -> this.codeGenerationService.createTestSources( URLDecoder.decode(myTestUrl, StandardCharsets.UTF_8), false)) .orElse(new GithubSource("none", "none", List.of(), List.of())); return new GithubSources(sources, test); } }

The ‘CodeGenerationController’ has the method ‘getSources(…)’. It gets the url and optionally the testUrl for the class to generate tests for and for the optional example test. It decodes the request parameters and calls the ‘createTestSources(…)’ method with them. The method returns the ‘GithubSources’ with the sources of the class to test, its dependencies in the project and the test example.

The method ‘getGenerateTests(…)’ gets the ‘url’ for the test class and the optional ‘testUrl’ to be url decoded and calls the method ‘generateTests(…)’ of the ‘ CodeGenerationService’.

Service

The ‘CodeGenerationService‘ collects the classes from Github and generates the test code for the class under test.

The Service with the prompts looks like this:

@Service public class CodeGenerationService { private static final Logger LOGGER = LoggerFactory .getLogger(CodeGenerationService.class); private final GithubClient githubClient; private final ChatClient chatClient; private final String ollamaPrompt = """ You are an assistant to generate spring tests for the class under test. Analyse the classes provided and generate tests for all methods. Base your tests on the example. Generate and implement the test methods. Generate and implement complete tests methods. Generate the complete source of the test class. Generate tests for this class: {classToTest} Use these classes as context for the tests: {contextClasses} {testExample} """; private final String ollamaPrompt1 = """ You are an assistant to generate a spring test class for the source class. 1. Analyse the source class 2. Analyse the context classes for the classes used by the source class 3. Analyse the class in test example to base the code of the generated test class on it. 4. Generate a test class for the source class and use the context classes as sources for creating the test class. 5. Use the code of the test class as test example. 6. Generate tests for each of the public methods of the source class. Generate the complete source code of the test class implementing the tests. {testExample} Use these context classes as extension for the source class: {contextClasses} Generate the complete source code of the test class implementing the tests. Generate tests for this source class: {classToTest} """; @Value("${spring.ai.ollama.chat.options.num-ctx:0}") private Long contextWindowSize; public CodeGenerationService(GithubClient githubClient, ChatClient chatClient) { this.githubClient = githubClient; this.chatClient = chatClient; }

This is the ‘CodeGenerationService’ with the ‘GithubClient’ and the ‘ChatClient’. The GithubClient is used to load the sources from a publicly available repository and the ChatClient is the Spring AI interface to access the AI/LLM.

The ‘ollamaPrompt’ is the prompt for the IBM Granite LLM with a context window of 8k tokens. The ‘{classToTest} is replaced with the source code of the class under test. The ‘{contextClasses}’ can be replaced with the dependent classes of the class under test and the ‘{testExample}’ is optional and can be replaced with a test class that can serve as an example for the code generation.

The ‘ollamaPrompt2’ is the prompt for the Deepseek Coder V2 and Codestral LLMs. These LLMs can ‘understand’ or work with a chain of thought prompt and has a context window of more than 32k tokens. The ‘{…}’ placeholders work like the same as in the ‘ollamaPrompt’. The long context window enables the addition of the context classes for the code generation.

The ‘contextWindowSize’ property is injected by Spring to control if the context window of the LLM is big enough to add the ‘{contextClasses}’ to the prompt.

The method ‘createTestSources(…)’ collects and returns the sources for the AI/LLM prompts:

public GithubSource createTestSources(String url, final boolean referencedSources) { final var myUrl = url.replace("https://github.com", GithubClient.GITHUB_BASE_URL).replace("/blob", ""); var result = this.githubClient.readSourceFile(myUrl); final var isComment = new AtomicBoolean(false); final var sourceLines = result.lines().stream().map(myLine -> myLine.replaceAll("[\t]", "").trim()) .filter(myLine -> !myLine.isBlank()).filter(myLine -> filterComments(isComment, myLine)).toList(); final var basePackage = List.of(result.sourcePackage() .split("\\.")).stream().limit(2) .collect(Collectors.joining(".")); final var dependencies = this.createDependencies(referencedSources, myUrl, sourceLines, basePackage); return new GithubSource(result.sourceName(), result.sourcePackage(), sourceLines, dependencies); } private List<GithubSource> createDependencies(final boolean referencedSources, final String myUrl, final List<String> sourceLines, final String basePackage) { return sourceLines.stream().filter(x -> referencedSources) .filter(myLine -> myLine.contains("import")) .filter(myLine -> myLine.contains(basePackage)) .map(myLine -> String.format("%s%s%s", myUrl.split(basePackage.replace(".", "/"))[0].trim(),	myLine.split("import")[1].split(";")[0].replaceAll("\\.", "/").trim(), myUrl.substring(myUrl.lastIndexOf('.')))) .map(myLine -> this.createTestSources(myLine, false)).toList(); } private boolean filterComments(AtomicBoolean isComment, String myLine) { var result1 = true; if (myLine.contains("/*") || isComment.get()) { isComment.set(true); result1 = false; } if (myLine.contains("*/")) { isComment.set(false); result1 = false; } result1 = result1 && !myLine.trim().startsWith("//"); return result1; }

The method ‘createTestSources(…)’ with the source code of the Github source ‘url’ and depending on the value of the ‘referencedSources’ the sources of the dependent classes in the project provides the ‘GithubSource’ records.

To do that the ‘myUrl’ is created to get the raw source code of the class. Then the ‘githubClient’ is used to read the source file as string. The source string is then turned in source lines without formatting and comments with the method ‘filterComments(…)’.

To read the dependent classes in the project the base package is used. For example in a package: ‘ch.xxx.aidoclibchat.usecase.service’ the base package is: ”ch.xxx’. The method ‘createDependencies(…)’ is used to create the ‘GithubSource’ records for the dependent classes in the base packages. The ‘basePackage’ parameter is used to filter out the classes and then the method ‘createTestSources(…)’ is called recursively with the parameter ‘referencedSources’ set to false to stop the recursion. That is how the dependent class ‘GithubSource’ records are created.

The method ‘generateTest(…)’ is used to create the test sources for the class under test with the AI/LLM:

public String generateTest(String url, Optional<String> testUrlOpt) { var start = Instant.now(); var githubSource = this.createTestSources(url, true); var githubTestSource = testUrlOpt.map(testUrl -> this.createTestSources(testUrl, false)) .orElse(new GithubSource(null, null, List.of(), List.of())); String contextClasses = githubSource.dependencies().stream() .filter(x -> this.contextWindowSize >= 16 * 1024) .map(myGithubSource -> myGithubSource.sourceName() + ":" + System.getProperty("line.separator") + myGithubSource.lines().stream() .collect(Collectors.joining(System.getProperty("line.separator"))) .collect(Collectors.joining(System.getProperty("line.separator"))); String testExample = Optional.ofNullable(githubTestSource.sourceName()) .map(x -> "Use this as test example class:" + System.getProperty("line.separator") + githubTestSource.lines().stream() .collect(Collectors.joining(System.getProperty("line.separator")))) .orElse(""); String classToTest = githubSource.lines().stream() .collect(Collectors.joining(System.getProperty("line.separator"))); LOGGER.debug(new PromptTemplate(this.contextWindowSize >= 16 * 1024 ? this.ollamaPrompt1 : this.ollamaPrompt, Map.of("classToTest", classToTest, "contextClasses", contextClasses, "testExample", testExample)).createMessage().getContent()); LOGGER.info("Generation started with context window: {}", this.contextWindowSize); var response = chatClient.call(new PromptTemplate( this.contextWindowSize >= 16 * 1024 ? this.ollamaPrompt1 : this.ollamaPrompt, Map.of("classToTest", classToTest, "contextClasses", contextClasses, "testExample", testExample)).create()); if((Instant.now().getEpochSecond() - start.getEpochSecond()) >= 300) { LOGGER.info(response.getResult().getOutput().getContent()); } LOGGER.info("Prompt tokens: " + response.getMetadata().getUsage().getPromptTokens()); LOGGER.info("Generation tokens: " + response.getMetadata().getUsage().getGenerationTokens()); LOGGER.info("Total tokens: " + response.getMetadata().getUsage().getTotalTokens()); LOGGER.info("Time in seconds: {}", (Instant.now().toEpochMilli() - start.toEpochMilli()) / 1000.0); return response.getResult().getOutput().getContent(); }

To do that the ‘createTestSources(…)’ method is used to create the records with the source lines. Then the string ‘contextClasses’ is created to replace the ‘{contextClasses}’ placeholder in the prompt. If the context window is smaller than 16k tokens the string is empty to have enough tokens for the class under test and the test example class. Then the optional ‘testExample’ string is created to replace the ‘{testExample}’ placeholder in the prompt. If no ‘testUrl’ is provided the string is empty. Then the ‘classToTest’ string is created to replace the ‘{classToTest}’ placeholder in the prompt.

The ‘chatClient’ is called to send the prompt to the AI/LLM. The prompt is selected based on the size of the context window in the ‘contextWindowSize’ property. The ‘PromptTemplate’ replaces the placeholders with the prepared strings.

The ‘response’ is used to log the amount of the prompt tokens, the generation tokens and the total tokens to be able to check if the context window boundary was honored. Then the time to generate the test source is logged and the test source is returned. If the generation of the test source took more than 5 minutes the test source is logged as protection against browser timeouts.

Conclusion

Both models have been tested to generate Spring Controller tests and Spring service tests. The test urls have been:

http://localhost:8080/rest/code-generation/test?url=https://github.com/Angular2Guy/MovieManager/blob/master/backend/src/main/java/ch/xxx/moviemanager/adapter/controller/ActorController.java&testUrl=https://github.com/Angular2Guy/MovieManager/blob/master/backend/src/test/java/ch/xxx/moviemanager/adapter/controller/MovieControllerTest.java
http://localhost:8080/rest/code-generation/test?url=https://github.com/Angular2Guy/MovieManager/blob/master/backend/src/main/java/ch/xxx/moviemanager/usecase/service/ActorService.java&testUrl=https://github.com/Angular2Guy/MovieManager/blob/master/backend/src/test/java/ch/xxx/moviemanager/usecase/service/MovieServiceTest.java

The ‘granite-code:20b’ LLM on Ollama has a context window of 8k tokens. That is too small to provide ‘contextClasses’ and have enough tokens for a response. That means the LLM just had the class under test and the test example to work with.

The ‘deepseek-coder-v2:16b’ and the ‘codestral:22b’ LLMs on Ollama have a context window of more than 32k tokens. That enabled the addition of the ‘contextClasses’ to the prompt and the models can work with chain of thought prompts.

Results

The Granite-Code LLM was able to generate a buggy but useful basis for a Spring service test. No test worked but the missing parts could be explained with the missing context classes. The Spring Controller test was not so good. It missed too much code to be useful as a basis. The test generation took more than 10 minutes on a medium power laptop cpu.

The Deepseek-Coder-V2 LLM was able to create a Spring service test with the majority of the tests working. That was a good basis to work with and the missing parts where easy to fix. The Spring Controller test had more bugs but was a useful basis to start from. The test generation took less than ten minutes on a medium power laptop cpu.

The Codestral LLM was able to create a Spring service test with 1 test failing. That more complicated test needed some fixes. The Spring Controller test had only 1 failing test case, but that was because a configuration call was missing that made the tests succeed without doing the testing. Both generated tests where a good starting point. The test generation took more than half an hour on a medium power laptop cpu.

Opinion

The Deepseek-Coder-V2 and the Codestral LLMs can help writing tests for Spring applications. Codestal is the better model but needs significantly more processing power and memory. For productive use both models need GPU acceleration. The LLM is not able to create non trivial code correctly, even with context classes available. The help a LLM can provide is very limited because LLMs do not understand the code. Code is just characters for a LLM and without an understanding of language syntax the results are not impressive. The developer has to be able to fix all the bugs in the tests and add missing parts like assertions. That means it just saves some time typing the tests.

The experience with Github Copilot is similar to the Granite-Code LLM. At september 2024 the context window is too small to do good code generation and the code completion suggestions need to be ignored too often.

Is a LLM a help -> yes.

Is the LLM a large timesaver -> no.