Natural Language Processing, or NLP, is a field combining computer science, linguistics and artificial intelligence. Natural Language refers to human language: the code of communication we use as people to transmit thoughts and ideas to one another; English is one obvious example. Essentially, NLP scientists are concerned with building technology that 'understands' linguistic data, with the goal of extracting information from the content of documents.
One proposed function of NLP was the idea of automatic summarisation. If a technology could be created that was able to distinguish the critically important elements of a text from the superfluous details and formalities, virtually instantly, it would be a ground-breaking development in the academic world, facilitating research and education immeasurably. The amount of time and cognitive cost spent determining the relevance of a document's content to a specific information was enormous; the data rate with a human brain is just too slow.
The first effort to produce automatic summaries was made by Hans Peter Luhn in 1958. Using fairly rudimentary logic, Luhn innovated a heuristic method of summarisation using an algorithm that based the importance of sentences in a text on the words' statistical frequencies. This, however, was quite a blunt approach and not considered very accurate.
The next significant development was from H.P. Edmundson in 1969, who recognised the shallowness of this approach. Edmundson suggested that word frequency alone was not enough to determine an element's importance. He noticed cue words in sentences function to reduce or increase their overall significance; the implications of a sentence containing titular words; and the location of the sentence within the paragraph or document.
Edmundson saw that it was not one feature but numerous features that should determine the content of a summary, and those features might not be weighted equally. However, both Luhn and Edmundson were restricted by the hardware of the time. Luhn had to punch the texts onto cards before they could be processed, and Edmundson could not digest documents over 4000 words because of limited computer storage. It was also an expensive enterprise. Given that the goal was to save readers time and effort, quite a lot was going into producing a far-from-perfect end product.
Progress in machine summarisation stagnated for a while; no one saw a way of innovating in the data-driven route taken by Luhn and Edmundson. It was time for a new perspective on the task.
In the 1970s, a paradigm shift took place. Instead of focusing on the content itself, in the way that Luhn and Edmundson did, scientists began developing methods that relied on the processors using information about the problem at hand. They created systems intended to understand the texts, producing summaries on the basis of the information they understood.
One man behind this approach was Gerald DeJong. In 1982, he built a system known as FRUMP (Fast Reading Understanding and Memory Program), used things called sketchy scripts to highlight the key information from a text. Sketchy scripts were templates tailored to particular domains which FRUMP filled in according to a document's content. The scripts could then be used to generate summaries.
The problem with this, however, was for every domain (e.g. protests, restaurants, football matches, etc.), a sketchy script had to be written up to account for that specific domain. In other words, processors required a backdrop of information regarding the document's content in order to determine the key elements within it, which had to be created manually.
In 1985, a group of computer scientists developed a similar but slightly different approach called SUSY. But this also relied on hand-coded information and needed domain-specific knowledge to generate summaries. The summaries of both the FRUMP and SUSY methods also contained abstracts i.e. information in the output which wasn't present in the input (not in the same form anyway).
The 1990s saw a return to the empirical approach first explored in the 50s and 60s, but this time it was more advanced. Instead of heuristics automatically acquired from the data, informational importance was determined by procedures explicitly coded by the researchers themselves.
By using graph-based ranking models and calculating links within the texts, scientists were better able to chart the significance of more nuanced linguistic features such as anaphora, lexical repetition and coreferential links. The synthesis of these analytical features calculated a score for each sentence, generating summaries from the highest scoring sentences.
While this was a marked improvement, the linguistic features and their relative significance were not understood concretely enough to produce truly accurate summaries.
The turn of the millennium saw computer scientists seeking more advanced artificial intelligence. Machine learning algorithms were sought out to generate better summaries than before. While this gave rise to further improvements through tools like sentence compression and recall-oriented evaluation, there was still work to be done.
2015 saw neural approaches based on deep learning come into play for automatic summarisation. Neural networks were developed: non-linear statistical data modelling tools which could infer complicated relationships in data to find patterns within them. In fact, the whole concept of a neural network was first proposed by Alan Turing himself in 1948.
AIs became able to model the structure of discourse in a document and use it to determine the most important components. This is similar to how Genei's summarisation tool works.
Genei’s summariser converts words into 'computer-readable' objects called vectors. (It actually breaks down words into tokens and then converts these into computer readable objects, there are ~1.5 tokens per word). Once it has created these computer-readable words, it is able to generate numerical representations for words, sentences, paragraphs and more. Following this, the algorithm is able to rank sentences based on their importance and relevance in the document.
The algorithm was developed by training our model on a large data set, containing articles and human-written summaries. The model tried to randomly generate a summary and, based on how similar it was with the human-written summary, continued to generate until it scored highly. Once it scored highly on all of the example articles and summaries, it was then considered trained, and now can be used on unseen articles!