Marie Garnier Patrick Saint-Dizier
Improving the Use of English in Requirements
Analysis, results, and recommendations
For most international industries, English is the main language of communication in technical documents. Among them, requirements are specifically designed to be easy to read and as efficient and unambiguous as possible for their users and readers. They must leave little space for personal interpretation. Non-native speakers of English often find themselves in the position of having to write requirements without extensive training in the use of English for this task, which may result in lexical, grammatical and style errors. Controlled languages are usually used as a way to go around this difficulty, but they fail to address the specific stumbling stones of non-native speakers, and even have their own shortcomings. In this article, we present an analysis of the errors found in a corpus of requirements written in English by French speakers, and we attempt to highlight the most efficient ways to help requirements engineers limit the number of language errors in their work. Results are also relevant in the case of requirements written in English by speakers of other languages.
1. Introduction
As a text genre, requirements have to follow sets of rules in form and content. With the recent emphasis on the importance of requirement quality and proper training for requirements engineering, discussions have focused on higher order problems, such as gathering information for requirements, ensuring coherence in long documents or selecting criteria for requirement validation.
The role of linguistics and natural language processing (NLP) in requirement engineering and management has gained more importance over the last few years. In particular the International Requirements Engineering Conference regularly includes articles on that topic, as does the present magazine. The article "Readable requirements are not a matter of course – or are they?" (Rabeler, RE Magazine, issue 2014(4)) develops the complex notion of requirement readability, which shares connections with the present article. In addition, the two-part article "How requirements engineers can benefit from applying the NLP communication techniques" (Thomas and Georgieva, RE Magazine issues 2016(1, 2)) develops the notion of neuro-NLP and discourse notions such as reframing or generalizations. Finally, the present article is a continuation of research on using NLP to improving quality in requirements, which is presented in the article "LELIE – An intelligent assistant for improving requirement authoring" (Saint-Dizier and Kang, RE Magazine issue 2015 (2)) [14].
In this article, we propose to tackle requirement quality from another angle, that of the quality of their language and grammar. This aspect of requirements is just as crucial as questions of content in avoiding approximations and misunderstandings. Specifically, we focus on the use of English in requirements written by French native speakers.
The research presented here is the result of a project funded by the IREB Academy Program. The objective of the research was to gather data on the language errors produced by French native speakers writing requirements in English. Errors found in a corpus of requirements for two industrial domains were thoroughly analyzed and categorized in order to identify common error types in this specific type of writing, and find ways to improve language quality and readability in requirements.
Our project stems from the following initial observations:
- A preliminary analysis of non-native speakers' output shows a large number of errors, making requirements sometimes obscure or prone to interpretation errors.
- Authoring norms exist, but they don't address the proper use of English grammar and lexicon in requirements for non-native speakers, nor do they issue warnings about the loss of intelligibility that may occur as a consequence of lexical and grammatical errors.
- As a result, requirements may include errors that decrease intelligibility and readability. This situation increases the risk of ambiguity and misinterpretation, leading to problems of misconception and lack of productivity and efficiency.
- There is a large pool of research on second language (L2) learners/users' errors, especially for English, as well as research on L2 corpora (e.g. Granger et. al., 2009) [5]. However, since requirements form a specialized linguistic genre, we make the hypothesis that the task of writing requirements in English is associated with different types of writing behaviors and language errors, warranting a specific treatment.
- Editing and correcting errors manually in requirements is a very time-consuming task, especially when done collegially in meetings. Providing warnings and corrections during the initial writing process is a way to save time in the editing stage, and enable the editors to focus on other aspects of the quality of requirements (Saint-Dizier, 2015) [13].
We have chosen to focus specifically on requirements written by French native speakers. Research in second language acquisition has shown the influence of a speaker's first language on their use of a second language (e.g. Jarvis and Pavlenko, 2007) [7], a phenomenon called transfer, or cross-linguistic influence. Language transfer plays a major role in error production, and gives precious indications as to the requirement engineer's intended meaning and possible remediation.
2. Research methodology
Research and analysis approach
Our research relies on the manual analysis of corpora of requirements written in English by French native speakers. We use the methodology of error analysis, a research method initially developed and used in the domain of Second Language Acquisition (Corder, 1981) [2]. The main steps of error analysis include the identification, classification, and interpretation of errors.
The analysis is conducted by a single trained linguist specialized in English grammar with a background in research on linguistics-based automatic grammar checking for English. Due to the complexity of the task, the different steps are performed manually. Requirements are read and screened for errors, which are then tagged.
Overview of corpora
We analyzed 772 requirements extracted from 3 different sources in two technical domains. They are hereafter referred to as Corpus 1, Corpus 2 (both in aeronautics but coming from two different companies) and Corpus 3 (telecommunications).
Our corpora are composed only of requirements: technical documents usually contain textual parts that are of no interest for this research (e.g. introductions, summaries, definitions, contexts, diagrams, etc.), and which were therefore not included in the corpora. As a result, we present the numerical data in terms of number of requirements rather than number of words. The number of words in each requirement varies greatly, from under 10 words to over 100 words.
We checked that these texts were indeed produced by French native speakers. It is extremely difficult to assess the actual level of English proficiency of the authors, as the corpora are compiled a posteriori. We thus don't have access to detailed information about the authors and have no opportunity to test them. However, we estimate that the authors have a B2 or at least B1 level (respectively higher and lower intermediate levels in the Common European Framework of Reference for Languages issued by the Council of Europe) [3], partly because obtaining the jobs the authors have usually requires a B2 level. The names of the companies who provided these documents are kept anonymous at their request.
3. Requirement quality with respect to controlled language norms
Most companies ask that a given controlled natural language (or CNL) be used in their technical documentation. Resorting to a controlled language may be seen as a way to bypass the question of language quality in requirements, since they constrain and simplify the types of grammar and lexicon that can be used in such documents. As we explain below, it is unfortunately not a completely effective solution.
After our initial observation of the presence of errors in requirements that are supposed to follow the norms of a controlled language, we ran a preliminary investigation on the adequacy of the requirements in our corpus with respect to the standard guidelines and norms (i.e. INCOSE, IREB, general simplified natural language, (Kuhn, 2014)) [8]. In parallel to our main investigation on errors in requirements, we have also tried to identify the main ways in which the requirements in our corpus fail to follow the recommendations of the controlled language used in the industry.
Here are some examples of the main deviations from controlled language norms found in our corpus (for a more detailed overview, see (Saint-Dizier and Kang, 2015)) [14]:
- Extensive use of the passive voice: the use of this form is usually not recommended.
- Extensive use of fuzzy terms (e.g. particular conditions, abnormal parameters, equivalent equipment, wherever possible, etc.): up to 4 fuzzy terms have been observed in single requirements.
- Long sentences (more than 30 words): whereas the norm usually states that sentences should not exceed 20 words, it is not unusual to find longer sentences in requirements. These long sentences often contain several subordinate clauses or complex coordination, which decrease intelligibility even further.
- Complex morphological forms: norms ban the use of the modal will to refer to the future (e.g. ventilation will be provided), as well as other complex constructions involving auxiliaries.
- Use of negation: while negation cannot be avoided in all circumstances, a number of cases should be rewritten (e.g. the solenoid shall be not activated as long as the input voltage is lower than 5mV – in this case, the sentence also includes a language error).
4. Definition and classification of errors
Identifying errors found in requirements
At its most basic, an error is defined as "an unsuccessful bit of language" (James, 1998) [6]. The most common criterion for declaring a segment of language "unsuccessful" is grammaticality, that is to say whether or not the segment follows the rules of grammar. Acceptability is another criterion that focuses on whether or not the segment might be produced by a native speaker in an appropriate context (Lyons, 1968) [10].
The concepts of "competence" and "performance" are also important in the definition of errors. Competence errors are attributable to a lack of knowledge in the language, while performance errors are due to external factors, such as lack of attention, stress or fatigue (Corder, 1981) [2]. However, researchers have highlighted the fact that even though this distinction is theoretically relevant, it is practically impossible to distinguish competence errors from performance errors (Thouësny, 2011) [15].
We have adapted these commonly used criteria to the objectives of our project as well as to the nature of the documents in our corpus to define the types of segments we identify as errors in requirements:
- Grammaticality: segments that don't follow morpho-syntactic rules (e.g. plural agreement, modifier placement, stacks of Nouns), lexico-syntactic rules (e.g. choice of preposition after a verb), or other basic grammar rules are identified as errors.
- Acceptability: requirements need to be written as clearly and as intelligibly as possible so as to eliminate ambiguities and confusions. For this reason, we stretch the notion of acceptability to include clarity and intelligibility. As a result, segments that introduce ambiguities, lack clarity, or require an effort from the user to understand the intended meaning are identified as errors.
Classifying errors in our project
Error categories usually rely on a set of criteria, sometimes used in combinations of two or three but very often used independently:
- The linguistic domain of the error (e.g. morphology, syntax, spelling, etc.);
- The part of speech bearing the error or that needs to be modified to correct the error;
- The linguistic system linked with the error (e.g. agreement, verb complementation, etc.);
- The description of surface phenomena (e.g. word omission, extra word, wrong word, word order, etc.).
Consider the following example:
The system shall includes a locking device.
In this sentence, the verb include should not have the ending –s since it follows a modal auxiliary (modals require the use of an uninflected verb form, or verb base, after them). Here is how this error would be described for each of the above criteria:
We designed our classification system in order to obtain precise and comparable data about errors found in requirements. Previous research has posited that the type of classification needed to yield comparable data includes the rank of the linguistic unit to be taken into account for the error to become apparent (i.e. the error domain, Lennon, 1991) [9], usually Noun Phrase, Verb Phrase, Clause, etc., as well as a number of sublevels of classification giving more detailed information, such as "Preposition selection", "Placement of modifiers", etc. (Garnier, 2014) [4]. We also document spelling errors.
For the error given above (The system shall includes a locking device), the form of the verb only appears as wrong if we take into account the presence of the modal auxiliary before it, as modals require an uninflected verb form. As a result the error domain is Verb Phrase*. The second level of classification is Modality, since the error is linked to the use of a modal.
In addition, we introduce a distinction between what we call "central" and "marginal" categories. Central categories fit the description we gave above, while marginal categories allow for the researcher's own margin of error: since we are not specialists in the technical domains represented in the corpus, and don't have access to a list of expressions allowed in the companies the requirements come from, we use marginal categories to document segments that appear to us as errors, but may not actually be perceived as such by the requirements engineer and the readership.
Marginal categories include:
- The use of words that are either neologisms or don't normally have the meaning that seems to be intended in the requirement, but which could be specific to the field and thus acceptable in the context of requirements;
- Expressions that are deemed ungrammatical or unacceptable by the researcher, but which might be standard practice in the field for which the requirements are written.
Our system thus includes 2 marginal categories and 27 central categories. Errors in the central categories are distributed in 5 error domains and 1 Other category. Most domains are divided into sublevel categories. As explained above, some sublevel categories require an extra level of detail, especially when the errors represent different surface phenomena (ex. Preposition selection vs. Missing preposition).
5. Results and discussion
Presentation of results
Overall, we have found 279 errors in 188 requirements, in a corpus of 772 requirements. This means that 1 out of 4 requirements contained at least one error. Nearly half of these contained more than one error, with a small proportion of them including up to 4 errors.
Surprisingly, Corpus 1 and Corpus 3 show a similar proportion of errors, with 18% of the requirements in these corpora containing at least one error, while Corpus 2 has nearly double the amount, with 33% of requirements containing at least one error. In all instances, the presence of errors is nowhere near negligible.
Diagram 2 shows the proportion of errors coming from each sub-corpus in our corpus of errors.
Errors are most often found in the domain of Noun Phrase, which accounts for about 46 % of errors, and the marginal categories, with about 26 % of errors. In each of them, one sublevel category holds the majority of errors: most errors in the Noun Phrase category are linked to modification (and mostly to modifier stacking, as we will see below), with 25 % of errors in total, and most errors in the marginal categories are linked to the use of non-standard expressions, with about 15 % of errors in total (which, as was stated above, may not be considered to be errors in the relevant technical domain).
Out the 29 final categories, marginal ones included, only 6 account for more than 5 % of errors, with 20 of them accounting for less than 3 %. However, when put together they gather 37 % of all errors. This is evidence of the wide diversity of errors found in requirements, highlighting the fact that more than a third of errors prove difficult to prevent, since each category stems from a different grammatical or lexical problem.
The sublevel category we have identified as "modifier stacking" comprises segments in which a noun phrase is composed of a head noun and a string of modifiers to its left, sometimes with their own embedded modifiers. Here are a few examples of this error type:
- EX1. the following probable average operational duty cycle of the X
- EX2. The system shall include a locking in full closed position device.
- EX3. the reference computed ventilation flow (flight leg computed minimum reference flow i.e. X).
- EX4. The maximum engine casing temperature shall be at 670°C and the inner fixed structure thermal blanket temperature shall be 350°C
The use of several noun modifiers in an NP is becoming increasingly common in English, especially in technical and journalistic English (Pastor-Gomez, 2011) [11]. They are favored in these two areas because they eliminate the need for prepositions and some determiners. Preposition selection and determiner selection are two of the main difficulties non-native English users face when writing in English, so we make the hypothesis that the use of noun modifiers appeals to this type of writers as a way to try and control the number of errors they produce. However, by eliminating prepositions and sometimes plurals, such structures rely on implicit information that needs to be reconstructed by the reader. As a result, they may lead to longer reading times and difficulties in interpreting the requirement (Biber et al., 1999) [1]. Moreover, these errors overlap with controlled language recommendations on the use of noun complements and heavy noun phrases.
The category of "non-standard expressions", found in the marginal error categories, includes segments such as the following ones:
- EX5. The tenderer shall indicate if Open source codes are embedded in this layer of the solution and case yes, the tenderer to confirm if Open source codes are compliant with Open source community licensing.
- EX6. In case one analog acquisition is detected failed on Safety channel, corresponding X output label shall be sent with Failure Warning validity status.
The fact that we included such segments in the marginal categories means that even though their form may seem ungrammatical or unacceptable in general English, we recognize that they might be standard in the language of requirements. Having no way to ascertain the acceptability of their use, we chose to document them. The first requirement is an example of a structure found only in Corpus 2, and the second requirement represents a structure found only in Corpus 3. We didn't find any non-standard expressions in Corpus 1. This imbalance indicates that the use of non-standard expressions is not necessarily expected and/or accepted in all companies, and may be domain- or even company-dependent.
The second requirement is actually one example of a type of structure that takes several forms in Corpus 2. All instances of this structure contain a form of ellipsis; here are a few additional examples:
- EX7. On ground, when PACK is selected OFF, corresponding RARV shall stay in position
- EX8. Both FCVs are selected open
- EX9. During PACK starting sequence the FCV is not commanded fully open
- EX10. In the event the RARV is failed stuck in position
From the point of view of surface syntax, these segments have different forms (e.g. a past participle followed by a participial adjective, a past participle followed by a preposition, a past participle followed by one adjective or by an adjective phrase composed of a head adjective and a modifying adverb or a PP complement), but from a semantic point of view they are built on the same model, which is close to that of verbal expressions such as to turn on, to switch off: The second term or phrase indicates the position or situation of a "mobile" element, such as a switch (e.g. OFF, ON, open, closed, failed), while the first one either specifies the action leading to that situation, or the observation of that situation (e.g. selected, detected). It is imaginable that this type of phrasing is accepted and even expected in the companies for which the requirements were written. This type of segment would therefore be an example of an ungrammatical but acceptable phrasing, and it can be seen as an efficient way to avoid the use of some prepositions.
This observation initially prompted us to exclude them from the error corpus. However, we found an alternative phrasing (selected to OFF), suggesting that the practice may not be as stable or as widely used as we initially thought. Furthermore, the ellipsis of prepositions and other words, which may be seen as increasing concision and simplicity, also creates a gap that must be filled by the reader, and may lead to ambiguities if the use of such structures is not the same in all instances.
Specificity of the genre of requirements: comparison with other error corpora
In order to find out whether the errors found in requirements were similar to those found in other non-native output, we compared our error corpus from requirements with the results of research on errors in English learner productions and research papers written by French native speakers writing in English (Garnier, 2014) [4].
First, we looked at the distribution of errors according to the main categories, and more specifically those of Noun Phrase, Verb Phrase and Sentence and Clause. Since NPs and VPs are the minimal elements of sentences, they usually account for the highest number of errors. We did not count the proportions of NPs, VPs, clauses and sentences in each corpus, but we made the conservative assumption that they are similar. Diagram 4 shows the proportions of errors found in the comparable corpora for these three main categories. For this comparison, we only took central categories into account, since the marginal categories we identified are specific to requirements.
We notice that there is a great difference between the distribution of errors in our corpus of requirements and other corpora. Similar error rates are found for the three categories in learner productions and research papers, while requirements show significantly fewer errors linked to the VP or at the level of the clause and sentence, and significantly more on the NP.
The smaller proportion of errors in the VP in requirements could be attributed to the high level of proficiency of the writers and the fact that the requirements are proofread, which may eliminate most agreement errors. However, we make the hypothesis that the low frequency of errors in the VP is mostly due to controlled natural languages prohibiting the use of complex verb groups in requirements. This could also explain the low rate of errors at the level of the clause or sentence, as the constrained use of complex syntax helps limit errors. Conversely, requirements use more complex vocabulary and expressions, which may lead to a higher proportion of errors in the Noun Phrase.
In addition, we compared the proportion of the two most common error types in our requirements corpus with those found in our two comparison corpora. Diagram 5 shows the frequency for these error types.
In the case of missing articles (ex. All components shall meet the requirements of [ ] table presented below), we notice that results in the two comparison corpora are similar, and are much lower than in the requirement corpus. However, determination errors in general account for 24 % (student essays) and 16 % (scientific papers) of all errors in the two comparison corpora, indicating that authors produce a more varied range of determination errors in these types of writing than in requirements, where determination errors other than missing articles are non-existent.
In the case of modifier stacking, there is a progression in the number of errors found in the three corpora, with these being marginal in the corpora of student essays. This is consistent with other studies on such structures (e.g. Pastor-Gomez, 2011) [11], which identify them as a feature of technical, scientific or journalistic English. We should note that, since we are looking at "absolute" rather than "relative" error numbers (i.e. only the total number of errors, not the number of errors over the number of total uses of the structure), it is not surprising to see lower error rates on this structure in types of writing that typically don't make use of them. Errors linked to modifier stacking are also more varied and complex in the corpus of requirements, with up to 5 modifiers on the left of a head noun (see examples above).
Overall, the most frequent error categories correspond to "simplification" strategies, with the omission of function words or punctuation that may be perceived by the author as superfluous or expendable. Three out of the six categories have to do with "missing" words or parts of words, while the use of noun modifiers eliminates the need for prepositions. The two types of non-standard expressions we reviewed also seem to be indicative of simplification strategies. In the example from Corpus 2 (and case yes, the tenderer to confirm if…), a more grammatically acceptable version of the requirement would include more words and a more complex syntax (e.g. and if it is the case, the tenderer must confirm that…). It is also the case for the example from Corpus 3 (the segment in case one analog acquisition is detected failed can be corrected as e.g. in case one analog acquisition is detected as having failed / as being in a state of failure).
We conclude from this comparison that the distribution of errors reflects the specificities of the genre of technical documents, warranting the collection and use of data in this genre, and specific treatment as an L2 production.
6. Summary and recommendations for training
In this article, we presented the results of a research project devoted to the analysis of language errors in requirements written in English by native speakers of French. We analyzed 772 requirements from 3 corpora in 2 different technical domains, and detected 279 language errors. These errors were categorized using a tailored classification system which includes marginal (e.g. non-standard expressions) and central (e.g. agreement errors) categories.
We found a large variety of errors, with only 6 of the total 29 categories accounting for more than 5% of the errors each. However, some general tendencies were identified. A majority of errors (62%) occur in the Noun Phrase. This result is at odds with results from error analysis in non-native speaker productions from other genres, indicating that requirements form a specialized text genre that reflects specific writing behaviors. In addition, a significant proportion of errors is linked to missing punctuation and misspelled words, and can be remedied through the use of spellcheckers and grammar checkers.
A high number of segments were found to be unacceptable, or even ungrammatical from the point of view of standard English, but might be deemed acceptable in the context of requirements. Finally, we found numerous errors linked to the use of multiple adjectives or nouns in front of the head noun of the phrase. This type of structure can lead to interpretation errors and decrease the readability of the requirement.
The 6 most frequent error types, which account for 62% of errors in total, are not equal in terms of impact on readability, and most importantly in terms of ease of correction and prevention. For example, errors on the use of articles are notoriously frequent in the productions of non-native speakers writing in English, but they are also very difficult to address in automatic grammar checking or even in in-person teaching. Training providers should therefore focus on the errors for which the remediation is relatively simple (ex. spelling errors), or errors that are really detrimental to the readability of the requirement (ex. use of several nouns and adjectives in an NP, see our discussion of nocuous ambiguity below). The error types that we recommend trainers to address are not directly linked to the native language of the requirement authors, therefore our recommendations can be used for authors with native languages other than French, with any adjustments that training providers might think necessary. Specifically, these recommendations could be used as part of the syllabus of the CPRE Foundation Level offered by IREB, more precisely in the "Language effects" sub-unit (see link to the syllabus in the list of references) [17].
The importance of using a spellchecker
Despite the ubiquitous presence of spellcheckers, 7% of all errors found in requirements are linked to spelling errors.
In addition to the fact that the errors themselves may decrease the clarity of the requirements, the main problem is that they affect the credibility and image of the company or requirement engineer directly. In the eyes of a client or contractor, the presence of easily avoidable errors in technical documentation may be the telltale sign of more significant errors in other areas.
Fortunately, the distribution of these errors shows that they are not the result of a lack of spelling skills, but rather of momentary lapses (e.g. one common word spelled correctly most times and wrong a few times), and subsequent lack of editing that can be fixed relatively easily. Writing requirements is a difficult task, therefore requirements writers should be heavily encouraged to rely on spellcheckers for part of the editing process. This can be achieved by making sure authors are familiar with the use of spellcheckers and notice the corrections proposed.
Policing the use of non-standard expressions
Roughly 15% of the errors found in requirements came from the use of non-standard expressions. However, it is perfectly acceptable for requirements authors to use non-standard expressions in their writing, since requirements writing is a technical task using specialized English. Moreover the constraints of the form of controlled English used might call for dedicated expressions.
Nevertheless, resorting to a non-standard expression should be a conscious choice. In order for their use not to decrease the readability of requirements, alternative syntax and expressions should only be used if the following criteria are met:
- The alternative expression fills a need that cannot be filled using standard syntax. For example, the alternative expression uses fewer words, or avoids the use of a structure that is not allowed in the controlled English in use.
- The non-standard expression is used by other requirements writers in the same field or the same company, and with the same exact meaning. Writers are aware that they are not using standard syntax, and know why they are using an alternative expression. Ideally, the non-standard expression is included in a writing guide or manual for the requirements writers of this field or company.
- The use of the non-standard expression is completely stable. It is used in the same way every time, with no hesitation between the non-standard expression and standard syntax, or between two similar non-standard expressions.
Improving the readability of Noun Phrases
NPs that include several modifiers, and especially in the form of other nouns, are very common in technical writing because they reduce the number of words by eliminating the need for prepositions and some determiners, and give the impression of a compact delivery of information. However, when several modifiers are stacked in front of an NP, the readability of the NP decreases, and interpretation errors may occur as readers of requirements are left to reconstruct the intended meaning.
In particular, ambiguity arises from the fact that the modifiers used in these NPs often contain their own modifiers, making it difficult to identify the exact scope of each element, and determine whether the NP demonstrates stacked (e.g. [thermal [system breakdown]]) or embedded (e.g. [[thermal system] breakdown]) modification. The use of noun modifiers further complicates the issue, since they can function as heads as well as modifiers in NPs and embedded nominal modifiers.
When discussing ambiguous structures, we must address the question of nocuous ambiguity. Ambiguity is said to be innocuous when a theoretically ambiguous text is interpreted in the same way by different readers regardless of its ambiguity; it is said to be nocuous when the ambiguity in the structure actually yields different interpretations (Willis et al., 2008) [16]. According to this study, nearly half of the cases of syntactic ambiguity were attributable to the use of nominal modifiers.
Research on the nocuous status of modifier stacking including nominal modifiers would be very useful in helping to identify the structures that should be corrected. However, as is visible from the examples from our corpus given in section 5, most NPs with modifier stacking include embedded and stacked modification, and usually more than 2 modifiers. We posit that these two factors are enough to create nocuous ambiguity. In addition, the lack of prepositions and determiners clarifying the relationships between the elements of the NP may lengthen reading times and mobilize cognitive resources to the detriment of other elements of the requirement.
As a consequence, in order to minimize the risk of nocuous ambiguity in NPs and to maintain fluidity in reading, we recommend that the number of modifiers placed before the head noun of an NP be limited to two adjectives or two nouns, or one of each, therefore ensuring that no more than 3 elements will be found in succession in an NP without a preposition or conjunction. There is an overlap between our recommendations and other constraints on requirements writing, since the use of heavy NPs is often discouraged in controlled English.
Training providers should make sure that requirements authors are familiar with the structure and pay attention to the readability issues that may arise when it is used. In addition, trainers should help authors choose adequate ways to rewrite heavy NPs.
Further work: Implementation in the LELIE Research Platform
This article mainly deals with the analysis of non-native authors writing in English. The main errors that were identified can be expressed on the basis of patterns and implemented in the LELIE authoring platform (Garnier, 2014) [4], (Saint-Dizier, 2015) [13], (Saint-Dizier and Kang, 2015) [14]. In this platform, errors can be signaled in the text either by means of dedicated tags, or via specific comments when texts are in Word or Excel. When a correction is automatically induced by the system, it is then suggested in the comment.
Such an implementation would allow us to test our diagnosis and the possibility to provide authors with automatic corrections, which is often welcome but needs some control from the authors. Such corrections can also be automatically learned when they turn out to be recurrent.
The LELIE technical text authoring platform is a university prototype (Saint-Dizier, 2014)[12]. It has been plugged into Word and Excel to allow authors to call LELIE from their document and to make corrections directly on their document. The LELIE platform is freely available from the authors under a creative commons license.
References and Literature
- [1] Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan et Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education.
- [2] Corder, Stephen Pit. 1981. Error Analysis and Interlanguage. Oxford: Oxford University Press.
- [3] Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Council of Europe.
- [4] Garnier, Marie. 2014. "Utilisation de méthodes linguistiques pour la détection et la correction automatisées d'erreurs produites par des francophones écrivant en anglais." PhD Thesis in English Linguistics. Université de Toulouse. https://tel.archives-ouvertes.fr/tel-01257640
- [5] Granger Sylviane, Dagneaux Estelle, Meunier Fanny and Paquot Magali. 2009. The International Corpus of Learner English. Version 2. Handbook and CD-Rom. Louvain-la-Neuve: Presses Universitaires de Louvain.
- [6] James, Carl. 1998. Errors in Language Learning and Use: Exploring Error Analysis. London: Longman.
- [7] Jarvis, Scott and Pavlenko, Aneta. 2007. Crosslinguistic Influence in Language and Cognition. Londres, New York: Routledge.
- [8] Kuhn, T. 2014. "A survey and classification of controlled natural languages." Computational Linguistics, 40(1).
- [9] Lennon, Paul. 1991. "Error: some problems of definition, identification and distinction". Applied Linguistics, vol. 12, n°2. 180-196.
- [10] Lyons, John. 1968. Introduction to Theoretical Linguistics. Cambridge: Cambridge University Press.
- [11] Pastor-Gomez, Iria. 2011. The Status and Development of N+N Sequences in Contemporary English Noun Phrases. Bern: Peter Lang.
- [12] Saint-Dizier, Patrick, 2014, Challenges of Discourse Processing: the case of technical documents. Cambridge Scholars.
- [13] Saint-Dizier, Patrick. 2015. "Features of an error correction memory to enhance technical texts authoring in LELIE." IJKCDT journal, vol 5(2).
- [14] Saint-Dizier, Patrick and Kang, Juyeon. 2015. "LELIE - An Intelligent Assistant for Improving Requirement Authoring." RE Magazine, vol 2(2).
- [15] Thouësny, Sylvie. 2011. "Modeling second language learners' interlanguage and its variablity: A computer-based dynamic assessment approach to distinguishing between errors and mistakes". PhD dissertation, Dublin City University.
- [16] Willis, Alistair, Chantree, Francis and De Roeck, Anne. 2008. "Automatic identification of nocuous ambiguity." Research on Language and Computation, 6(3-4), 355-374.
- [17] IREB, 2015. Certified Professional for Requirements Engineering – Foundation Level, Syllabus. Version 2.2. https://www.ireb.org/en/downloads/tag:syllabi#top
Marie Garnier is an associate professor in the English Department at the Université Toulouse 2 – Jean Jaurès (Toulouse, France). She holds a PhD in English linguistics on the topic of automatic grammar checking. Her research interests include the interface between syntax and lexical semantics in English, the definition and processing of errors produced by English learners, and linguistics-driven NLP. She can be reached at mgarnier@univ-tlse2.fr.
Patrick Saint-DizierPatrick Saint-Dizier, PhD, is a senior researcher in Computational Linguistics and Artificial Intelligence at CNRS, IRIT, Toulouse, France. He is specialized in discourse and semantic analysis. He has developed several national and European projects dedicated to logic programming, argumentation and technical text analysis. He is the author of several conference and journal articles and of 11 books. Besides foundational research, he has a long practice and experience of research and development activities. Contact: stdizier@irit.fr