2024 - Unleashing The Potential of Prompt Engineering in LLM
2024 - Unleashing The Potential of Prompt Engineering in LLM
Abstract
This paper delves into the pivotal role of prompt engineering in unleashing the
capabilities of Large Language Models (LLMs). Prompt engineering is the pro-
cess of structuring input text for LLMs and is a technique integral to optimizing
the efficacy of LLMs. This survey elucidates foundational principles of prompt
engineering, such as role-prompting, one-shot, and few-shot prompting, as well as
more advanced methodologies such as the chain-of-thought and tree-of-thoughts
prompting. The paper sheds light on how external assistance in the form of
plugins can assist in this task, and reduce machine hallucination by retrieving
external knowledge. We subsequently delineate prospective directions in prompt
engineering research, emphasizing the need for a deeper understanding of struc-
tures and the role of agents in Artificial Intelligence-Generated Content (AIGC)
tools. We discuss how to assess the efficacy of prompt methods from different
perspectives and using different methods. Finally, we gather information about
the application of prompt engineering in such fields as education and program-
ming, showing its transformative potential. This comprehensive survey aims to
serve as a friendly guide for anyone venturing through the big world of LLMs and
prompt engineering.
1 Introduction
In recent years, a significant milestone in artificial intelligence research has been
the progression of natural language processing capabilities, primarily attributed to
large language models (LLMs). Many popular models, rooted in the transformer
architecture [1], undergo training on extensive datasets derived from web-based text.
Central to their design is a self-supervised learning objective, which focuses on pre-
dicting subsequent words in incomplete sentences. Those models are called Artificial
1
Intelligence-Generated Content (AIGC), and their ability to generate coherent and
contextually relevant responses is a result of this training process, where they learn to
associate words and phrases with their typical contexts.
LLMs operate by encoding the input text into a high-dimensional vector space,
where semantic relationships between words and phrases are preserved. The model
then decodes this representation to generate a response, guided by the learned sta-
tistical patterns [2]. The quality of the response can be influenced by various factors,
including the prompt provided to the model, the model’s hyperparameters, and the
diversity of the training data.
These models, such as GPT-3 [3], GPT-4 [4], along with many others (e.g., Google’s
BARD [5], Anthropic’s Claude2 [6] and Meta’s LLaMA-2 [7]), have been utilized to
revolutionize tasks ranging from information extraction to the creation of engaging
content [8]. Related to AI systems, the application of LLMs in the workplace has the
potential to automate routine tasks, such as data analysis [9] and text generation
[10], thereby freeing up time for employees to focus on more complex and rewarding
tasks [11]. Furthermore, LLMs have the potential to revolutionize the healthcare sec-
tor by assisting in the diagnosis and treatment of diseases. Indeed, by analyzing vast
amounts of medical literature, these models can provide doctors with insights into
rare conditions, suggest potential treatment pathways, and even predict patient out-
comes [12]. In the realm of education, LLMs can serve as advanced tutoring systems,
and promote the quality of teaching and learning [13]. Those AI tools can analyze a
student’s response, identify areas of improvement, and provide constructive feedback
in a coherent and formal manner.
In real applications, the prompt is the input of the model, and its engineering
can result in significant output difference [14]. Modifying the structure (e.g., altering
length, arrangement of instances) and the content (e.g., phrasing, choice of illustra-
tions, directives) can exert a notable influence on the output generated by the model
[15]. Studies show that both the phrasing and the sequence of examples incorporated
within a prompt have been observed to exert a substantial influence on the model’s
behavior [15, 16].
The discipline of prompt engineering has advanced alongside LLMs. What orig-
inated as a fundamental practice of shaping prompts to direct model outputs has
matured into a structured research area, replete with its distinct methodologies and
established best practices. Prompt engineering refers to the systematic design and
optimization of input prompts to guide the responses of LLMs, ensuring accuracy, rel-
evance, and coherence in the generated output. This process is crucial in harnessing
the full potential of these models, making them more accessible and applicable across
diverse domains. Contemporary prompt engineering encompasses a spectrum of tech-
niques, ranging from foundational approaches such as role-prompting [17] to more
sophisticated methods such as “chain of thought” prompting [18]. The domain remains
dynamic, with emergent research continually unveiling novel techniques and applica-
tions in prompt engineering. The importance of prompt engineering is accentuated by
its ability to guide model responses, thereby amplifying the versatility and relevance of
LLMs in various sectors. Importantly, a well-constructed prompt can counteract chal-
lenges such as machine hallucinations, as highlighted in studies by [19] and [20]. The
influence of prompt engineering extends to numerous disciplines. For instance, it has
facilitated the creation of robust feature extractors using LLMs, thereby improving
their efficacy in tasks such as defect detection and classification [21].
In this paper, we present a comprehensive survey on the prompt engineering of
LLMs. The structure of the paper is organized as follows: Section 2 presents the foun-
dational methods of prompt engineering, showcasing various results. It encompasses
both basic and advanced techniques. Section 3 further explores advanced method-
ologies, including the use of external assistance. All the examples are generated on
a non-multimodal language generative model, called “default GPT-4”, developed by
OpenAI. Section 4 discusses potential future direction in prompt engineering. Section
5 provides insights into prompt evaluation techniques, drawing comparisons between
2
subjective and objective assessment methods. Finally, Section 6 focuses on the broader
applications of prompt engineering across various domains.
3
Model input
I want to understand some cutting-edge aspects of technology.
Model output
Here are some of the key areas:
the given scenario, as it reduces the model’s uncertainty and guides it toward the
correct response.
For instance, as shown in Figure 2, instead of asking a vague requirement such as “I
want to understand the cutting edge of technology.”, a more precise prompt would be
“I want to understand the cutting edge of technology, specifically related to artificial
intelligence and machine learning...”.
Model input
I want to understand the cutting edge of technology, specifically related to artificial
intelligence and machine learning. Recently, artificial intelligence has become
extremely popular, especially the large language models which have amazed me.
Please analyze the possible reasons behind the rapid advancement of this
technology from at least three perspectives.
Model output
Absolutely, the rise of AI and machine learning, especially with regard to large lan-
guage models like GPT-4, can be seen as a convergence of several key factors. I’ll
analyze this from three perspectives: technological advancements, availability of data,
and socio-economic demand.
1. Technological advancements: The growth of AI and machine learning models has
been facilitated by several technological innovations. For starters, we’ve seen incredi-
ble improvements in computational power, particularly in Graphics Processing Units
(GPUs) and now even more specialized AI chips. These advances have made it ...
2. Availability of data: ...
3. Socio-economic demand: ...
2.4 Role-prompting
Role-prompting is another fundamental method in prompt engineering. It involves
giving the model a specific role to play, such as a helpful assistant or a knowledgeable
expert [32]. This method can be particularly effective in guiding the model’s responses
4
and ensuring that they align with the desired output. For instance, if the model is
prompted to act as a historian, it is more likely to provide a detailed and contextually
accurate response when asked about a historical event [33]. Another brief example is
shown in Figure 3.
Model output
A role prompt Here are five basic methods to
quickly get started with large lan-
You are an expert in artificial guage models:
intelligence specializing in large
1. Experiment with web-based
language models...
interfaces:
...
5
Standard Prompt
Model input
Q: Yes or no: would a pear sink
One-shot Prompt
in water? Model input
Q: Yes or no: is it common to see
frost during some college
commencements?
Model output A: The answer is yes.
Q: Yes or no: would a pear sink
A: Yes, a pear would typically in water?
float in water, not sink. The rea-
son for this is that pears, like
many fruits, are composed pri-
marily of water and have a density
Model output
that is less than that of water.
Objects that are less dense than
water will float when placed in A: The answer is No. ✓
water.
×
Fig. 4 Comparison of standard prompt and one-shot prompt.
3 Advanced methodologies
The foundational methods from the previous section can help us produce satisfactory
outputs. However, experiments indicate that when using LLMs for complex tasks
such as analysis or reasoning, the accuracy of the model’s outputs still has room for
improvement. In this section, we will further introduce advanced techniques in prompt
engineering to guide the model in generating more specific, accurate, and high-quality
content.
6
themselves new skills by chain-of-thought reasoning. In another study, [49] used CoT
prompting to train verifiers to solve math word problems, demonstrating the tech-
nique’s potential in educational applications. [50] proposed a multimodal version of
CoT, called Multimodal-CoT, to handle more complex, multimodal tasks beyond sim-
ple text-based tasks, such as visual tasks, further expanding the potential applications
of CoT.
Model input
Imagine an infinitely wide entrance, which is more likely to pass through it, a
military tank or a car?
Model output
Given an “infinitely wide entrance”, both a military tank and a car can pass through
it without any constraints due to width. Therefore, neither is more likely than the
other based on width alone. Both can pass through with equal likelihood.
However, if other factors such as ground conditions, accessibility, purpose, or autho-
rization come into play, the likelihood might change. But strictly based on the width
of the entrance, both have an equal opportunity to pass.
7
Model input
Imagine an infinitely wide entrance, which is more likely to pass through it, a
military tank or a car? Let’s think step by step.
Model output
Given an infinitely wide entrance, both a military tank and a car could pass through
without any issues related to the width. However, there are still several factors that
could influence which is more likely to pass through:
1. Ground surface: tanks are designed to handle rough terrains that cars may struggle
with. If the ground leading to and from the entrance is too rugged or steep for a car,
the tank would be more likely to pass through.
2. Weight restrictions: if the entrance has a load limit due to the materials used in
its construction, the car, being lighter, could be more likely to pass through. The
average car weighs around 1.5 tons, while a military tank can weigh anywhere from
60 to 70 tons.
3. Height clearance: if the entrance is infinitely wide but not infinitely tall, the height
of the vehicle matters. The average height of a car is around 1.5 meters, while tanks
can be around 2.4 meters. If the entrance is shorter than around 2.4 meters, a car
would be more likely to pass through.
4. Regulations and permissions:
...
In conclusion, without specific information about the conditions at the entrance, we
would assume that a car is more likely to pass through, simply because cars are more
common, more likely to meet potential height and weight restrictions and face fewer
regulatory barriers.
3.2 Self-consistency
In the assessment of INSTRUCTGPT [53] and GPT-3 [3] on a new synthetic QA
dataset called PRONTOQA, for Proof and Ontology-Generated Question-Answering
[54, 55], it was observed that although the most extensive model exhibited capability
in reasoning tasks, it encountered challenges in proof planning and the selection of the
appropriate proof step amidst multiple options, which caused accuracy uncertainties
[54]. Self-consistency in LLMs is an advanced prompting technique that aims to ensure
the model’s responses are consistent with each other [18, 56], which greatly increases
the odds of obtaining highly accurate results. The principle behind it is that if a model
is asked to answer a series of related questions, the answers should not contradict each
other.
The self-consistency method contains three steps. Firstly, prompt a language model
using CoT prompting, then replace the “greedy decode” (1-Best) [25, 57] in CoT
prompting by sampling from the language model’s decoder to generate a diverse set
of reasoning paths, and finally, marginalize out the reasoning paths and aggregate by
choosing the most consistent answer in the final answer set.
It is noteworthy that self-consistency can be harmoniously integrated with most
sampling algorithms, including but not limited to, temperature sampling [39, 40], top-k
sampling [25, 58, 59], and nucleus sampling [26]. Nevertheless, such an operation may
necessitate the invocation of the model’s Application Programming Interface (API)
to fine-tune these hyperparameters. In light of this, an alternative approach could be
to allow the model to generate results employing diverse reasoning paths, and then
generate a diverse set of candidate reasoning paths. The response demonstrating the
highest degree of consistency across the various reasoning trajectories is then more
inclined to represent the accurate solution [60].
Studies have shown that self-consistency enhances outcomes in arithmetic, com-
monsense, and symbolic reasoning tasks [2, 61]. Furthermore, in practice, self-
consistency can be combined with other techniques to further enhance the model’s
performance. For example, a study found that combining self-consistency with a
8
discriminator-guided multi-step reasoning approach significantly improved the model’s
reasoning capabilities [62].
Model input
Generate two key analyses related to detailed size data on military tanks and cars,
and then generate three key influencing factors regarding whether an object can
pass through an infinitely wide entrance.
Model output
Detailed size analysis: military tanks vs. cars
1. Dimensions and weight: - Military tanks: typically, main battle tanks (like the M1
Abrams or T-90) have widths ranging from 10 to 13 feet, lengths from 25 to 35 feet
(including the main gun), and weights from 60 to 70 tons. - Cars: most consumer cars
have a width of about 6 to 8 feet, a length of 12 to 20 feet, and weigh between 1 to 3
tons. Some large SUVs and trucks can exceed these figures but are still substantially
smaller and lighter than tanks.
2. Maneuverability & design: - Military tanks: tanks are designed to traverse rough
terrains and have tracks that distribute weight and prevent the vehicle from getting
easily stuck in muddy or uneven ground. However, their sheer weight and size limit
their maneuverability in urban or congested settings. - Cars: designed for paved
roads, cars are generally more agile and can navigate tight spaces more efficiently
than tanks. Their design emphasizes aerodynamics and comfort, unlike tanks that
prioritize armor and firepower.
Three key influencing factors on whether an object can pass through an infinitely
wide entrance:
1. Height of the object: the height is critical. An infinitely wide entrance does not
give information about its height. If the object is taller than the entrance, it won’t
pass through.
2. Structural integrity: depending on the material and design of an object, squeezing or
forcing it through an entrance could compromise its structural integrity. For instance,
certain tanks have reactive armor or sensitive equipment that may be damaged if
scraped or bumped forcefully.
3. Ground clearance and maneuverability: ...
9
Model input
“““
Detailed size analysis: military tanks vs. cars
1. Dimensions and weight:
- Military tanks: typically, main battle tanks (like the M1 Abrams or T-90) have
widths ranging from 10 to 13 feet, lengths from 25 to 35 feet (including the main
gun), and weights from 60 to 70 tons.
- Cars: most consumer cars have a width of about 6 to 8 feet, a length of 12 to 20
feet, and weigh between 1 to 3 tons. Some large SUVs and trucks can exceed these
figures but are still substantially smaller and lighter than tanks.
2. Maneuverability & design...
”””
Use the above information to answer the following question: Imagine an infinitely
wide entrance, which is more likely to pass through it, a military tank or a car?
Model output
Given the information provided about the dimensions, design, and the factors influ-
encing whether an object can pass through an infinitely wide entrance, we can infer
the following:
1. Height of the object: the height is not specified for either tanks or cars. How-
ever, given that tanks are built for rough terrains and often have a turret and other
equipment on top, they generally have a taller profile compared to most cars. If the
entrance had a restrictive height, cars would generally have an advantage here.
2. Structural integrity: ...
3. Ground clearance and maneuverability:
...
Conclusion: If the entrance height is not restrictive and the surface is rough, a mili-
tary tank would be more likely to pass through it without issues. On the other hand,
if the entrance is low-height or has tight squeezes that could compromise the struc-
tural integrity of larger, robust vehicles, a car might be at an advantage.
However, given no height constraints and only the factor of width (since it is “infinitely
wide”), both a military tank and a car would easily pass through it. In such a scenario,
the distinguishing factor would be the terrain of the entrance. If it is rough or uneven,
the tank has the advantage. If it is smooth, both would pass with relative ease.
sophisticated responses from the language model. The foundational premise of this
approach is the decomposition of intricate problems into a succession of more rudi-
mentary subproblems, which are then sequentially addressed. The resolution of each
subproblem is expedited by leveraging solutions derived from antecedent subproblems.
Upon rigorous experimentation in domains including symbolic manipulation, com-
positional generalization, and mathematical reasoning, findings from [64] substantiate
that the least-to-most prompting paradigm exhibits the capacity to generalize across
challenges of greater complexity than those initially presented in the prompts. They
found that LLMs seem to respond effectively to this method, demonstrating its
potential for enhancing the reasoning capabilities of these models.
10
might first ask the model to outline the steps required to solve the problem. Subse-
quent prompts would then delve deeper into each step, guiding the model through a
systematic problem-solving process.
[65] demonstrates that this formulation is more versatile and can handle challeng-
ing tasks where standard prompts might fall short. Another research by [66] further
emphasizes the potential of this technique in enhancing the performance of LLMs by
structuring their thought processes.
[5] introduces the “tree-of-thought prompting”, an approach that assimilates the
foundational principles of the ToT frameworks and transforms them into a stream-
lined prompting methodology. This technique enables LLMs to assess intermediate
cognitive constructs within a singular prompt. An exemplar ToT prompt is delineated
in Figure 9 [5].
11
external knowledge into the model’s input [71, 72]. It is emerging as an AI frame-
work for retrieving facts from external sources. [73] examines the augmentation of
context retrieval through the incorporation of external information. It proposes a
sophisticated operation: the direct concatenation of pertinent information obtained
from an external source to the prompt, which is subsequently treated as foundational
knowledge for input into the expansive language model. Additionally, the paper intro-
duces auto-regressive techniques for both retrieval and decoding, facilitating a more
nuanced approach to information retrieval and fusion. This research demonstrates that
in-context retrieval-augmented language models [73], when constructed upon readily
available general-purpose retrievers, yield significant LLM enhancements across a vari-
ety of model dimensions and diverse corpora. In another research, [74] showed that
GPT-3 can reduce hallucinations by studying various components of architectures
such as Retrieval Augmented Generation (RAG) [75], Fusion-inDecoder (FiD) [76],
Seq2seq [77–79] and others. [80] developed Chain-of-Verification (CoVe) to reduce the
hallucinations, which introduces that when equipped with tool-use such as retrieval
augmentation in the verification execution step, would likely bring further gains.
4 Prospective methodologies
Several key developments on the horizon promise to substantially advance prompt
engineering capabilities. In the following section, some of the most significant trajec-
tories would be analyzed that are likely to shape the future of prompt engineering.
By anticipating where prompt engineering is headed, developments in this field can
be proactively steered toward broadly beneficial outcomes.
12
At the heart of most AI models, including GPT-4, are complex mechanisms
designed to understand and generate human language. The interplay of these mech-
anisms forms the “structure” of these models. Understanding this structure involves
unraveling the many layers of neural networks, the various attention mechanisms at
work, and the role of individual nodes and weights in the decision-making process of
these models [89]. Deepening our understanding of these structures could lead to sub-
stantial improvements in prompt engineering. The misunderstanding of the model may
cause a lack of reproducibility [90]. By understanding how specific components of the
model’s structure influence its outputs, we could design prompts that more effectively
exploit these components.
Furthermore, a comprehensive grasp of these structures could shed light on the
shortcomings of certain prompts and guide their enhancement. Frequently, the under-
lying causes for a prompt’s inability to yield the anticipated output are intricately
linked to the model’s architecture. For example, [16] found evidence of limitations in
previous prompt models and questioned how much these methods truly understood
the model.
Exploration of AI model architectures remains a vibrant research domain, with
numerous endeavors aimed at comprehending these sophisticated frameworks. A
notable instance is DeepMind’s “Causal Transformer” model [91], designed to explic-
itly delineate causal relationships within data. This represents a stride towards a more
profound understanding of AI model architectures, with the potential to help us design
more efficient prompts.
Furthermore, a more comprehensive grasp of AI model architectures would also
yield advancements in explainable AI. Beyond better prompt engineering, this would
also foster greater trust in AI systems and promote their integration across diverse
industries [92]. For example, while AI is transforming the financial sector, encom-
passing areas such as customer service, fraud detection, risk management, credit
assessments, and high-frequency trading, several challenges, particularly those related
to transparency, are emerging alongside these advancements [93, 94]. Another example
is medicine, where AI’s transformative potential faces similar challenges [95, 96].
In conclusion, the trajectory toward a better understanding of AI model structures
promises to bring significant advancements in prompt engineering. As we research
deeper into these intricate systems, we should be able to craft more effective prompts,
understand the reasons behind prompt failures, and enhance the explainability of AI
systems. This path holds the potential for transforming how we interact with and
utilize AI systems, underscoring its importance in the future of prompt engineering.
13
they strive to optimize both efficiency and precision in task execution. A salient chal-
lenge for prompt engineering emerges: devising and instituting prompts that adeptly
steer AI agents toward self-regulation [16].
In conclusion, the introduction of agent-based paradigms heralds a novel trajectory
for the evolution of AIGC tools. This shift necessitates a reevaluation of established
practices in prompt engineering and ushers in fresh challenges associated with the
design, implementation, and refinement of prompts.
14
to outperform other methods, with prompts having little impact on performance; in
zero-shot settings, expert-written task-specific prompts can significantly boost perfor-
mance, with automated prompts not outperforming simple baselines; the performance
of automated prompt generation methods is inconsistent, varying across different mod-
els and task types, displaying a lack of generalization. InstructEval provides important
references for prompt engineering and demonstrates the need for more universal and
reliable evaluation paradigms to design optimal prompts.
15
prompting [51] with GPT-3 to craft a foundational plan including elements such
as settings, characters, and outlines. Subsequently, it adopts a recursive technique,
dynamically prompting GPT-3 to produce extended story continuations. For another
example, Detailed Outline Control (DOC) [121] aims at preserving plot coherence
across extensive texts generated with the assistance of GPT-3. Unlike Re3 , DOC
employs a detailed outliner and detailed controller for implementation. The detailed
outliner initially dissects the overarching outline into subsections through a breadth-
first method, where candidate generations for these subsections are generated, filtered,
and subsequently ranked. This process is similar to the method of chain-of-though
(in Section 3.1). Throughout this generation process, an OPT-based Future Dis-
criminators for Generation (FUDGE) [122] detailed controller plays a crucial role in
maintaining relevance.
16
data generation. Besides, [131] is designed for the generation of supplementary syn-
thetic data for classification tasks. GPT-3 is utilized in conjunction with a prompt that
includes real examples from an existing dataset, along with a task specification. The
goal is to jointly create synthetic examples and pseudo-labels using this combination
of inputs.
7 Conclusion
In this paper, we present a comprehensive overview of prompt engineering techniques
and their instrumental role in refining Large Language Models (LLMs). We detail
both foundational and advanced methodologies in prompt engineering, illustrating
their efficacy in directing LLMs toward targeted outputs. We also analyze retrieval
augmentation and plugins, which can further augment prompt engineering. We discuss
broader applications of prompt engineering, highlighting its potential in sectors such
as education and programming. We finally cast a forward-looking gaze on the future
avenues of prompt engineering, underscoring the need for a deeper understanding of
LLM architectures and the significance of agent-based paradigms. In summary, prompt
engineering has emerged as a critical technique for guiding and optimizing LLMs. As
the ubiquity of prompt engineering develops, we hope that this paper can lay the
groundwork for further research.
8 Acknowledgement
This work was funded by the Natural Science Foundation of China (12271047);
Guangdong Provincial Key Laboratory of Interdisciplinary Research and Applica-
tion for Data Science, BNU-HKBU United International College (2022B1212010006);
UIC research grant (R0400001-22; UICR0400008-21; UICR0700041-22; R72021114);
Guangdong College Enhancement and Innovation Program (2021ZDZX1046).
References
[1] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al.
Attention is all you need. In: Proceedings of the 31st International Conference
on Neural Information Processing Systems. NIPS’17; 2017. p. 6000–6010.
[3] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. lan-
guage models Are Few-Shot Learners. In: Proceedings of the 34th International
Conference on Neural Information Processing Systems. NIPS’20; 2020. .
[5] Hulbert D.: Tree of knowledge: ToK aka Tree of Knowledge dataset for Large
Language Models LLM. Accessed: 2023-8-15. figshare https://github.com/
dave1010/tree-of-thought-prompting.
17
[9] Cheng L, Li X, Bing L. Is GPT-4 a good data analyst?; 2023. ArXiv:2305.15038.
[13] Baidoo-Anu D, Ansah LO. Education in the era of generative artificial intel-
ligence (AI): understanding the potential benefits of ChatGPT in promoting
teaching and learning. Journal of AI. 2023;7(1):52–62.
[21] Yong G, Jeon K, Gil D, Lee G. Prompt engineering for zero-shot and few-
shot defect detection and classification using a visual-language pretrained model.
Computer-Aided Civil and Infrastructure Engineering. 2022;38(11):1536–1554.
[22] Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D. Deep rein-
forcement learning from human preferences. Advances in neural information
processing systems. 2017;30.
18
Https://openai.com/research/language-unsupervised.
[26] Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text
degeneration. In: International Conference on Learning Representations; 2020. .
[27] Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J. Neural text generation
with unlikelihood training; 2019. ArXiv:1908.04319.
[29] YanSong S, JingLi Tencent A. Joint learning embeddings for Chinese words and
their components via ladder structured networks. In: Proceedings of the Twenty-
Seventh International Joint Conference on Artifificial Intelligence (IJCAI-18);
2018. p. 4375–4381.
[33] Buren DV. Guided scenarios with simulated expert personae: a remarkable
strategy to perform cognitive work; 2023. ArXiv:2306.03104.
[38] Liu J, Gardner M, Cohen SB, Lapata M. Multi-step inference for reasoning over
paragraphs; 2020. ArXiv:2004.02995.
[39] Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann
machines. Cognitive Science. 1985;9(1):147–169.
19
94–104.
[47] Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Large language
models perform diagnostic reasoning. In: Eleventh International Conference on
Learning Representations; 2022. .
[50] Huang S, Dong L, Wang W, Hao Y, Singhal S, Ma S, et al. Language is not all
you need: aligning perception with language models; 2023. ArXiv:2302.14045.
[56] Wang X, Wei J, Schuurmans D, Le QV, Chi EH, Narang S, et al. Self-
consistency improves chain of thought reasoning in language models. In:
Eleventh International Conference on Learning Representations; 2023. .
20
[57] Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, et al. Palm:
scaling language modeling with pathways; 2022. ArXiv:2204.02311.
[58] Fan A, Lewis M, Dauphin Y. Hierarchical neural story generation. In: Pro-
ceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers); 2018. p. 889–898.
[61] Shum K, Diao S, Zhang T. Automatic prompt augmentation and selection with
chain-of-thought from labeled data; 2023. ArXiv:2302.12822.
[65] Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. Tree of thoughts:
deliberate problem solving with large language models; 2023. ArXiv:2305.10601.
[72] Jiang Z, Xu FF, Gao L, Sun Z, Liu Q, Dwivedi-Yu J, et al. Active retrieval
augmented generation; 2023. ArXiv:2305.06983.
21
[75] Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-
augmented generation for knowledge-intensive nlp tasks. Advances in Neural
Information Processing Systems. 2020;33:9459–9474.
[76] Izacard G, Grave E. Leveraging passage retrieval with generative models for
open domain question answering; 2020. ArXiv:2007.01282.
[82] Bisson S.: Microsoft build 2023: Microsoft extends its copilots with open stan-
dard plugins. Accessed: 2023-05-25. figshare https://www.techrepublic.com/
article/microsoft-extends-copilot-with-open-standard-plugins/.
[85] whatplugin.: Prompt enhancer & ChatGPT plugins for AI development tools
like prompt enhancer. Accessed: 2023-09-14. figshare https://www.whatplugin.
ai/plugins/prompt-enhancer.
[87] for Search Engines C.: Prompt perfect plugin for ChatGPT. Accessed: 2023-10-
15. figshare https://chatonai.org/prompt-perfect-chatgpt-plugin.
22
[91] Melnychuk V, Frauen D, Feuerriegel S. Causal transformer for estimating coun-
terfactual outcomes. In: International Conference on Machine Learning; 2022.
p. 15293–15329.
[95] Amann J, Blasimme A, Vayena E, Frey D, Madai VI. Explainability for arti-
ficial intelligence in healthcare: a multidisciplinary perspective. BMC Medical
Informatics and Decision Making. 2020;20(1):1–9.
[96] Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature
Medicine. 2022;28(1):31–38.
[97] Öztürk D. What does artificial intelligence mean for organizations? A system-
atic review of organization studies research and a way forward. The Impact of
Artificial Intelligence on Governance, Economics and Finance, Volume I. 2021;p.
265–289.
[98] Seeamber R, Badea C. If our aim is to build morality into an artificial agent,
how might we begin to go about doing so? IEEE Intelligent Systems. 2023;.
[99] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics; 2002. p. 311–318.
[103] Sai AB, Mohankumar AK, Khapra MM. A survey of evaluation metrics used
for NLG systems. ACM Computing Surveys (CSUR). 2022;55(2):1–39.
23
[107] Jain N, Saifullah K, Wen Y, Kirchenbauer J, Shu M, Saha A, et al. Bring
your own data! Self-supervised evaluation for large language models; 2023.
ArXiv:2306.13651.
[109] Lin YT, Chen YN. LLM-eval: unified multi-dimensional automatic evaluation for
open-domain conversations with large language models; 2023. ArXiv:2305.13711.
[111] Dehghani M, Tay Y, Gritsenko AA, Zhao Z, Houlsby N, Diaz F, et al. The
benchmark lottery; 2021. ArXiv:2107.07002.
[112] Deng M, Wang J, Hsieh CP, Wang Y, Guo H, Shu T, et al. RLPrompt: optimiz-
ing discrete text prompts with reinforcement learning. In: Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing; 2022.
p. 3369–3391.
[113] Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, et al. Large language
models are human-level prompt engineers. In: Eleventh International Conference
on Learning Representations; 2022. .
[116] Xie Q, Dai Z, Hovy E, Luong MT, Le QV. Unsupervised data augmentation
for consistency training. In: Advances in neural information processing systems.
vol. 33; 2020. p. 6256–6268.
[117] Ariely M, Nazaretsky T, Alexandron G. Machine learning and Hebrew NLP for
automated assessment of open-ended questions in biology. International Journal
of Artificial Intelligence in Education. 2023;33(1):1–34.
[118] Nilsson F, Tuvstedt J. GPT-4 as an automatic grader: the accuracy of grades set
by GPT-4 on introductory programming assignments [Bachelor Thesis]. KTH
Royal Institute of Technology; 2023.
[120] Yang K, Peng N, Tian Y, Klein D. Re3: generating longer stories with recursive
reprompting and revision; 2022. ArXiv:2210.06774.
[121] Yang K, Klein D, Peng N, Tian Y. Doc: improving long story coherence with
detailed outline control; 2022. ArXiv:2212.10077.
[122] Yang K, Klein D. FUDGE: controlled text generation with future discriminators;
2021. ArXiv:2104.05218.
[123] Elgohary A, Hosseini S, Awadallah AH. Speak to your parser: interactive text-
to-SQL with natural language feedback. In: Annual Meeting of the Association
24
for Computational Linguistics; 2020. .
[128] Roy S, Roth D. Solving general arithmetic word problems. In: Proceedings
of the 2015 Conference on Empirical Methods in Natural Language Processing;
2015. p. 1743–1752.
[129] Li Y, Lin Z, Zhang S, Fu Q, Chen B, Lou JG, et al. Making language models
better reasoners with step-aware Verifier. In: Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). Toronto, Canada: Association for Computational Linguistics; 2023. p.
5315–5333. Available from: https://aclanthology.org/2023.acl-long.291.
[130] Ding B, Qin C, Liu L, Bing L, Joty S, Li B. Is GPT-3 a good data annotator?;
2022. ArXiv:2212.10450.
[131] Yoo KM, Park D, Kang J, Lee SW, Park W. GPT3Mix: leveraging large-scale
language models for text augmentation. In: Findings of the Association for
Computational Linguistics: EMNLP 2021; 2021. p. 2225–2239.
25
Chain-of-thought prompting enhances reasoning in LLMs by encouraging step-by-step verbalized reasoning processes, which is particularly effective for tasks such as solving word-based math problems. This approach provides a structured method for the model to follow logical reasoning steps, improving its ability to generate accurate solutions . By integrating strategies like self-talk and various prompting techniques, researchers have shown notable improvements in outcome accuracy and problem-solving efficiency .
Prompt engineering techniques enhance the applicability of LLMs in real-world scenarios by optimizing input prompts to guide model responses accurately and coherently. These techniques include foundational role-prompting and advanced methods like 'chain of thought' prompting, which improve tasks such as feature extraction, defect detection, and classification . By crafting precise prompts, one can counteract issues like machine hallucinations, thus expanding the versatility and relevance of LLMs across different sectors . Additionally, prompt engineering facilitates personalized learning and automated grading in education, as well as precise content creation, which includes story generation and maintaining coherence in extensive texts .
Prompt engineering is crucial in mitigating machine hallucinations in large language models by designing input prompts that steer the model towards accurate and factual responses. Studies have shown that the structure and content of prompts, including the phrasing and sequence of examples, significantly influence the model's behavior . Well-constructed prompts can guide the model to stay on topic, ensuring the generated content is coherent and factual, reducing the risk of hallucinations . This underscores prompt engineering's importance in leveraging LLMs effectively across various domains.
Recent advancements in prompt engineering enable the creation of personalized assessments in education by providing tailored prompts that adapt to an individual's learning pace and style. This approach allows for the development of more individual-centric teaching models where AI tools can generate custom content and feedback, enhancing the learning experience . By using sophisticated prompts, LLMs offer preliminary assessments, reducing educators' workload and providing instant feedback to students, fostering inclusivity and personalized learning .
Advancements in prompt engineering, such as the Recursive Reprompting and Revision framework (Re3) and Detailed Outline Control (DOC), have significantly enhanced LLMs' ability to assist in creative content generation like storytelling. Re3 employs zero-shot prompting with recursive techniques to dynamically extend storylines, while DOC uses detailed outlining to maintain coherence in extended narratives. These methods leverage prompt engineering to direct model outputs in crafting coherent and engaging creative pieces . The use of these sophisticated prompts helps in managing narrative structures and character development, thereby improving storytelling quality .
Automated prompt generation methods face limitations such as inconsistent performance across different models and task types, highlighting their lack of generalization compared to manually crafted prompts. Automated methods may not consistently outperform simple baselines and often struggle with specificity in zero-shot settings . Conversely, manually crafted prompts, particularly expert-written, offer task-specific precision and context that support better model performance, especially in complex or high-precision applications . This inconsistency in automated methods underscores the need for further research and development in prompt engineering to enhance their reliability and efficacy.
LLMs possess the potential to revolutionize healthcare by providing doctors with insights into rare conditions, suggesting treatment pathways, and predicting patient outcomes through analysis of extensive medical literature . However, challenges include ensuring the model's reliability and accuracy, mitigating hallucinations, and maintaining patient data confidentiality. Prompt engineering plays a role in enhancing model precision and addressing potential errors in generated insights, yet the complex and high-stakes nature of medical decisions presents significant hurdles for widespread LLM adoption .
Prompt engineering has evolved from basic practices of prompt shaping to a structured discipline with its own methodologies, focusing on the systematic design and optimization of prompts to guide LLM responses. This evolution has made it possible to address complex challenges, such as hallucinations, and enhance model versatility across industries . The continuous advancement in prompt engineering techniques, including role-prompting and chain-of-thought prompting, holds significant implications for future applications by making models more reliable, context-aware, and suitable for diverse domain-specific tasks . Its growth suggests a future of more refined and effective AI tools tailored to specific industry needs.
LLMs can enhance computer programming tasks using prompt engineering techniques, such as self-debugging prompting, which includes modules for feedback, unit-testing, and code explanation. This approach allows models like text-to-SQL to iteratively refine solutions and verify outputs . Additionally, the Repo-Level Prompt Generator (RLPG) retrieves repository context to dynamically construct effective prompts for code auto-completion tasks. By using these techniques, prompt engineering aids in improving code accuracy and efficiency, fostering better integration of LLMs in programming environments .
In few-shot settings, generic task-agnostic prompts tend to outperform other methods, with prompts having minimal impact on performance . However, in zero-shot settings, expert-written task-specific prompts significantly boost performance as they provide the necessary specificity and context that automated or generic prompts lack. The effectiveness of automated prompt generation methods varies widely across different models and tasks, indicating a lack of generalization . This suggests that expert-crafted prompts are crucial in applications requiring high precision, especially in zero-shot scenarios.