Search engine for plagiarism

Preface

One time I was lucky to have all sorts of odd jobs. For example, I almost got one administrator in the synagogue. Stopped me only the feeling that I was there as the last Gentile will be forced to work on Saturdays.

Another option was also curious. The firm wrote an essay and coursework for us students who are in the scrap to write for themselves. Later I learned that this is a fairly common and profitable business, which even invented its own name — "paper mill", but once such method of an earning on life seemed to be full sur. However, it is necessary to notice that the interesting challenges on this job were many and among them is the most complicated and cunning of those that I ever did in my career, and which can then be proud to tell our children.

The wording was very simple. Writers exchange — remote workers are very often the Arabs and the Negroes, for whom English was a non-native, and lazy they were no less the students themselves. They often followed the path of least resistance and instead of writing original works stupidly at loggerheads from the Internet, in whole or in part. Accordingly, it was necessary to find a source (or sources) to compare, to determine the percentage of plagiocephaly and to transmit the collected information to incrimination negligent.

It is somewhat facilitated by the language of course — he was exclusively English, without cases and a complicated inflectional forms; and much complicated by the fact that it was not clear which side is generally the case taken.

As implementation language was chosen Perl, which was very successful. Any static compiled language with their rigidity and ping start to solve this problem at all was impossible. To rewrite the ready decision and to come to him through numerous trials — no way. Well, plus a lot of great run-in libraries.

the

Deadlocks

Initially, poking around in the task assigned to some unruly student. The long philosophize. Time you need to search the Internet, so you'll need a search engine. Shoved there the entire text, and Google will find where it was taken. Then read the found sources and compare them with pieces of the original text.

Of course, nothing happened.

First, if you send Google a full text, then look for it will be very bad. In the end, they have stored the indexes in which the number of adjacent words is necessarily limited.

Second, very quickly it became clear that Google does not like when there are a lot of searching from the same address. Until then, I thought that the phrase "You banned in Google?" — it's just a joke. Turned out to be nothing like that. Google after a certain number of queries do bans, bringing a fairly complex captcha.

Well, the idea of parsing HTML is not very good — because the program may fall at any moment when Google decides just to improve the layout of the page with search results.

The student decided to encryption and to climb in search engine through open proxies: to find in the Internet list and go through them. In fact, half of these proxies did not work, and the other half hopelessly retarded, so any good process is not over.

Thirdly, search for chunks of text with character-by-character comparisons were unrealistically slow and totally impractical. And what's more useless — because the Kenyans were enough tweaks not to rewrite the text word for word, but slightly change the wording here and there.

Had to start with reading specialized literature. Since the task was marginal, in any textbook, and in any reputable book she described was not. All I found was a bunch of scientific articles on specific issues, and one review thesis some Czech. Alas, she caught me too late — by that time I already knew all the methods described there.
Unrelated topic, can't help but notice that almost all scientific articles printed in competent journals a) difficult and b) pretty useless. Those sites where they are stored, and which outputs the first links of the search engine, always paid and very biting — usually almost ten dollars for the publication. However, postarevshih better, you can usually find the same article and in the public domain. If you failed it, you can try to write to the author, who, as a rule, do not refuse courtesy to send a copy (from which I conclude that from the current system, the authors get little, and the proceeds go to someone else).

However, for each specific article of practical use, usually a bit. They, with rare exceptions, no information on which to sit and then jot down the algorithm. There are or abstract ideas without any guidance, how to implement them; or a bunch of mathematical formulas, making his way through who understand that the same thing could be written in two lines and in human language; or the results of the experiments conducted by the authors with the same comment: "clearly not everything, you need to continue on." I don't know if these articles are written for the sake, or rather for any intrascientific rituals, whether the toad presses to share real best practices, which quite successfully can use for their own startup. In any case, the erosion of science is obvious.

By the way, the biggest and famous website to search for plagiarism called Turnitin. This is a practical monopolist in this area. Inner work is classified not worse than the military base, I have not found any article, even short notes, telling us at least in General terms about what algorithms there are used. Solid mystery.

However, from the lyrics back to the stubs, this time to my own.

Not justified the idea with prints of the documents (document fingerprinting). In theory it looked good — each document downloaded from the Internet is considered to be its fingerprint — some long number that somehow reflects the content. This is supposed to be a routine base, in which instead of the documents themselves will be stored url and prints, and then it will be enough to compare the original text with the database of fingerprints to find the suspects. It's not working — the prints are shorter, the comparison is worse, and when they reach half the length of the source, then store them becomes meaningless. Plus the changes that suit the authors to trick the detection. Well, plus a huge amount of online storage even the short prints very quickly becomes burdensome because of the huge size of the data.

the

Parsing and normalization

This stage at first seems banal and uninteresting — well, it is clear that the input will clearly be the text in the MS Word format and not a text file; it must be dismantled, split into sentences and words. In fact, there is a great source of improving the quality of verification, which is far ahead of any tricky algorithms. It's like recognizing books — if the original is scanned crooked and smeared with ink, no subsequent tricks this is not correct.

Incidentally, parsing, and normalization is required not only for source text, but also for all links found on the Internet, so in addition to the quality it requires more and speed.

So, we've got a document in one of the common formats. Most of them dismantled easily, for example, perfectly readable HTML, using HTML::Parser, all PDF and PS can be processed by calling an external program like pstotext. Parsing OpenOffice is just fun, you can even tie the XSLT, if you enjoy perversions. Spoil the overall picture is just frickin ' Word — more a mongrel text format, it is impossible to find: hellishly difficult to parse and devoid of any structures within. For details, refer to my previous article. If I had my way, I would never understand, but it is common much more than all other formats combined. Whether it is the law Gresham in action, whether the machinations of the world evil. If God is benevolent, then why all write in the format of word?
In the process of parsing, if you got the normal format from text, you can find all sorts of useful things: for example, to find the table of contents of a document and exclude it from the comparison process (there is still nothing useful there). The same can be done with tables (short strachecky in the table cells give a lot of false positives). You can calculate the headings to throw pictures to tag Internet addresses. For web pages it makes sense to exclude the sidebar, headers and footers, if they are marked in the file (html5 allows it).

Yes, by the way, there may be backups that you have to unpack and get out each file. It is important not to confuse archive with some Packed complex type format OOXML.

Having just text, we can work on it more. Just for the benefit of discarding the cover page and service information, which the universities require mandatory ("student so-and-so", "checked Professor So T. O."). At the same time you can deal with the list of references. To find it is not so easy, because of the names he had at least a dozen ("References","List of reference","Works Cited","Bibliography" and so on). However, it can generally be not signed. Best of text to just throw, because the list is impeding the recognition, creating a considerable burden.

The resulting text should be normalized, that is, to Refine, giving it a unified form. The first step is to find all Cyrillic and Greek letters, the spelling of related English. Ingenious creators, insert them into the text to trick the check for plagiarism. But there it was: that kind of focus is the absolute evidence and a reason to drive this author to the neck.

Then all the vulgar contracted forms of the type can't replaced.

Now we have to change all highly nikolovska symbols on simple — quotes Christmas tree, quotes, inverted commas, long and medium dashes, apostrophes, ellipses and ligatures ff, ffi,st and stuff. Replace two apostrophes in a row on normal quotation marks (for some reason this comes across very often), and two dash — one. All sequences of whitespace characters (and there are also a lot of) is replaced by a normal space. Throw then out of the text, all that does not fit in the ASCII range of characters. And finally, remove all control characters other than a newline.

Now the text is ready to be compared.

Then divide it into sentences. It's not as simple as it seems at first glance. In the field of natural language processing in General everything seems easy only at first. Sentences can end with a period, ellipsis, exclamation and question mark or no way at all to end (end of paragraph).

Plus point can be only after all reductions, which are not the end of the sentence. Full list is half of the page — Dr. Mr. Mrs. Ms. Inc. vol. et.al. pp. and so on and so forth. Plus Internet links: well, in the beginning of the Protocol, but it is not always there. For example, the article may even talk about different online shops and constantly mention Amazon.com. So what else you need to know all domains — a dozen major and hundreds of pieces of domains by country.

And at the same time to lose the accuracy because the entire process now becomes probabilistic. Each point may or may not be the end of the sentence.

The original version of split text into sentences was written on the forehead — with the help of regular expressions were all the wrong points, were replaced by other characters, text beat to offers on the rest, then the dots came back.

Then I felt ashamed that I don't use advanced techniques developed by modern science, so I began to explore other options. Found a fragment in Java, was dismantled for a couple of geological eras (Oh, and boring, monotonous and verbose language). Found Python NLTK. But most of all I liked the work of one Dan Gillick (Gillick, Dan, "Improved Sentence Boundary Detection"), in which he boasted that his method is utterly superior to all others. The method was based on Bayesian probabilities and required prior training. On those texts, which I coached him, he was excellent, and the rest... Well, it's not very bad, but not much better the most embarrassing version list of abbreviations. I eventually came back.
the

search the web

so, now we have text and we have to make Google work for us to look for the pieces scattered throughout the Internet. Of course, the usual search cannot be used, and how? Of course, using Google API. Just some cases. The conditions there are much more liberal, comfortable and stable interface to programs, no studying HTML. The number of queries per day, though limited, but Google has not tested it. If not impudent, of course, sending millions of queries.

Now another question is what pieces to send text. Google stores some information about the distance between words. Empirically, it was found that optimum results are obtained by a series of 8 words. The final algorithm was:

the

Split the text into words
Throw out the so-called stop words (official, coming across more often — a, on and on and on. I used a list taken from mysql)
Generated queries of eight words with an overlap (that is, the first query — the words 1-8, the second 2-9, and so on. You can even overlap in two words, it saves queries, but slightly degrades quality)
If the text is large (>40kb), you can throw out every third request, and if a very large (>200 kb), even every second. It hurts the search, but not much, obviously, because plagiarists usually rod whole paragraphs, not individual sentences
Then send all requests for Google, even at the same time.
And finally, get answers, understand, make a General list and throw it out the duplicates. You can still sort the list of received addresses by the number of incoming duplicates and to cut the latter, considering that they are not revealing and that is not particularly affected. Unfortunately, here we meet with the so-called distribution of texts, which when searching for plagiarism looks from every angle. It's such Exhibitor upside down with a very long and sad tail, stretching into infinity. All the tail cannot be handled, and where to cut — is unclear. Wherever you otchekryzhili, the quality will deteriorate. So with the list of addresses. So I cut him based on empirical formulas that depend on the length of the text. In any case, guarantee a constant processing time as a function of the number of letters

The algorithm worked fine, until Google woke up and covered lafu. API remains, even improved, but the firm began to want money for it, and the rather high $ 4 per 1,000 requests. Had to look at alternatives, of which there were exactly two — Bing and Yahoo. Bing was free, but his dignity was over. He was looking much worse than Google. Maybe the new Evil Corporation, but their search engine is still the best in the world. However, Bing was looking even worse than himself — via the API, and he found one and a half times fewer links than from the user interface. He had a disgusting habit of the requests finish with an error and had to be repeated. Obviously, in such a way in the Microsoft regulate the flow of requests. In addition, the number of words in the search string had to be reduced to five, stop words, leave the overlap to only do one word.

Yahoo was somewhere in the middle between Google and Bing — and for the price, and the quality of the search.

In the process there is another idea. Head of the Department found project, which every day collected the contents of the entire Internet and put somewhere on Amazon. We could only grab the data and index its full-text database, and then to look for what you need. Well actually to write your own Google, only without the spider. It was, as you might imagine, is completely unrealistic.

the

Search in local database

One of the strengths of Turnitin is its popularity. There are sending a lot of work: the students — their teachers — students and their search base is increasing all the time. As a result, they can locate stolen not only from Internet but also from last year's course.
We went down the same path and made the local base — ready with orders and also with the materials that the users have attached to their bids ("Here's an article on the subject where you need to write an essay"). Writers, as it turned out, I love to rewrite their previous work.

All this stuff was in KinoSearch full-text database (now renamed the Lucy). The indexer was working on a separate machine. IMDb proved to be good — although the count of the documents was in the hundreds of thousands, searched quickly and thoroughly. The only drawback is when you add fields to the index or change the version of the library had to reindex again, that lasted a few weeks.

the

Compare

Well, now the most vigorous, without which all the rest is unnecessary. Checks we need two — first to compare the two texts and determine which one has the pieces from the other. If no such chunks, then you can not continue to save computing power. And if there is, then it comes in more complex and difficult algorithm which looks for suggestions.

Initially, the document comparison algorithm was used of shingles — pieces of the normalized text overlapping. Each piece is considered a kind of checksum, which is then used for comparison. The algorithm was implemented and even worked in the first version, but was worse than the search algorithms in vector spaces. However, the idea of shingles unexpectedly useful when searching, but about it I already wrote.

So, consider a coefficient of coincidence between documents. The algorithm will be the same as in the search engines. I will explain it in a simple way, in the collective farm, but the scientific description can be found in the scientific same book (manning C., Raghavan P., schütze H. Introduction to information retrieval. Williams, 2011). I hope not to confuse, and it is quite possible — then the most difficult part of the system, but still constantly changing.

So, take all the words from both articles, select the base words, throw out the duplicates and build a giant matrix. The columns will have the same basis and only two rows — the first text and second text. At the intersection put the number of times a particular word was found in the text.

The model is rather simple, it is called "bag of words" because it does not take into account the order of words in the text. But for us, she — thing, because plagiarists alteration of the text very often change the place of words, reformulating written off.

The allocation bases of words in linguistic jargon is called stemming. I spent it with Snowball — fast and no problems. Stemming need to improve the recognition of plagiarism — because savvy authors don't just copy someone else's text, and cosmetically change it, often turning one part of speech to another.

So we've got some of the basics of matrix that describes a huge multi-vector space. Now we believe that our lyrics are two vectors in this space, and consider the cosine of the angle between them (through scalar product). This will be a measure of similarity between texts.

Simple, elegant and in most cases is true. Bad only works if a single text is much more than the other.

Experimentally it was found that the texts with the similarity ratio <0.4 can not be considered. But then, after complaining to support about not found a pair of triples playacting proposals, the threshold had to be lowered to 0.2, which made it pretty useless (and then cursed CIPF).

Well, a few words about implementation. As compare to all the time have the same text, it makes sense to list its foundations and their number of occurrences. Thus, the quarter of the matrix is already in place.
To multiply vectors I first used PDL (what else?), but then, in the pursuit of speed, I noticed that vector work is terribly sparse, and I wrote my own implementation on the basis of Perl hashes.

Now we need to find the coefficient of similarity between sentences. There are two options and both are variations on the same theme of a vector space.

You can do quite simply is to take words from both sentences, to make a vector space out of them and calculate the angle. Doesn't even need to try to take into account the number of occurrences of each word — all the same words in the same sentence again is very rare.

But it can do otherwise — to apply the classical algorithm with tf/idf from the book, but instead a collection of documents, we will have the collection of sentences of the two texts, but instead of documents, respectively. Take a common vector space for both texts (already obtained when we calculated the similarity between the two texts), are building vectors, vectors model the number of occurrences of ln(occurrences/number of sentences). Then the result will be better — not dramatically, but noticeable.

If the threshold of similarity between the two proposals exceeds a certain value, we write the proposals found in the database, then to poke a similarity copycats.

And yet, if just one word, we even with nothing to compare — it is useless, the algorithm on some of the cores not working.

If the coefficient of similarity greater than 0.6 is here to the fortuneteller do not go, this is a paraphrased copy. If less than 0.4, the similarity is accidental or not. But in the interval formed is a grey area — it may be plagiarism, and just a coincidence, when in the opinion of the person in the texts have nothing in common.

Then comes into play another algorithm, which I learned from a good article (Yuhua Li, Zuhair Bandar, David McLean and James O'shea. "A Method for Measuring Sentence Similarity and its Application to Conversational Agents"). Here already it is heavy artillery — linguistic signs. The algorithm needs to take into account the irregular form of conjugation, relationships between words, like synonymy or hyperonymy and uncommon words. For all of this stuff requires relevant information in machine-readable form. Fortunately, the good people of Princeton University has long been engaged in a special lexical database for English language called Wordnet. On CPAN there are ready-made module to read. The only thing I did was transferred the information from text files, in which it is kept in Princeton, in MySQL table, and accordingly rewrote the module. Reading from a pile of text files no comfort or speed does not differ, and storing references to offsets in the file are not referred to as especially elegant.

the

Second version

Hmm... Second. And where is the first? Well, about the first to say. She took the text and consistently perform all steps of the algorithm is normalized, then it should be OK and produce results. Accordingly, nothing in parallel could not do and was slow.

So the rest of the work after the first version was aimed at one and the same — faster, faster, faster.

Since the main working time spent on obtaining links and retraction of information from the Internet, the access network — the first candidate on optimization. Sequential access was replaced by parallel downloading (LWP asynchronous Curl). Speed, of course, has grown fantastically. Joy are unable to ruin even glitches in the module, when it received 100 queries, performed 99 and indefinitely hung on the last one.

The General architecture of the new system was composed on the model of the OS. There is a control module that launches child processes, highlighting them on the "quantum" of time (5 minutes). During this time, the process needs to read from the database, than there was last time, to do the next step record database information to continue and finish. In 5 minutes you can do any operation, except for downloading and reference comparison, so this action were divided into parts of 100 or 200 links at a time. After five minutes, the controller will interrupt the execution anyway. Do not have time? I'll try next time.
However, the workflow itself should also follow the timer for its execution, because there is always the risk of running into some website which will hang everything (for example, one such site was transferred 100,000 words of English and nothing more was not there. It is clear that the above algorithm will look for similarities in three days and maybe even will ever find).

The number of worker processes can be changed, in theory — even dynamically. In practice, three processes were optimal.

Well, okay, what else was a MySQL database in which are stored the texts for processing and intermediate data, and along with the final results. And web interface on which users can see what is currently being processed, and at what stage it is.

The task determines the priority, so more important quicker. Priority was considered as some function of the file size (the more the slower it is processed) and deadline (the closer it is, the faster the desired results). The Manager chose the following task on the biggest priority, but with some random amendment — or low-priority tasks do not wait for your turn until there are more high priority.

the

Third version

the Third version was a product of evolutionary development, in terms of processing algorithms, and a revolution in architecture. I remember I was stuck somehow in the cold, in front of a bad date, waiting for Godot, and remembered recently read a story about the services Amazon. And files they store, and virtual machines are doing, and even they have all sorts of strange service of three letters. Here that I struck. I remembered the giant shrimp, seen once at the Sevastopol aquarium. She stands in the middle of the stones, waving his paws and filters the water. Carries over to her all sorts of delicious pieces, and she selects them, the water spits out further. And if you put a lot of shrimp in a row, so they are all there to filter for twenty minutes. And if these crustaceans and various types will catch every, in what prospects are opened.

Translating the figurative language to the technical. The service is Amazon SQS queues — such continuous conveyors, which are data. Do several programs that perform only one action, no context switch, spawn child processes, and other overhead costs. "The crane with morning to evening fills with water to the same bucket. Gas Stove heats the same pots, kettles and pans."

The implementation was simple and beautiful. Each of the above-described step of the algorithm is a separate program. Each has its own place. In queues there is a XML message, where it says that and how to do. There is another queue management and a separate program Manager that maintains order and updates the data on the progress, notifies the user about the occurred problems. Individual programs may send the response to the dispatcher, and can directly in the next place — how convenient. If an error occurred, then send a message to this Manager, and he certainly understands.

Automatically turns the error correction. If the program has not coped with the task and, for example, has fallen, it is restarted and the pending task remains in the queue and will emerge after some time again. Nothing is lost.

The only problem with the Amazon queue service guarantees that each message is delivered at least once. That is delivered to it by any will, but not the fact that one day. This should be ready and writing processes so that they properly react to doubles — or interpreting them (which is not very convenient, because it is necessary to conduct some kind of record), or treated idempotent.

Downloaded from the Internet files, of course, the messages are not forwarded and uncomfortable, and SQS has a limit on the size. Instead, they evolved on the S3, and the message is sent, only the link. Manager after the task completes, these temporary storage was cleaned.
Intermediate data (e.g., how many links we have to read and how many have already done) is stored in Amazon Simple Data Storage — simple, but a distributed database. SDS have also had limitations that should be taken into account. For example, it did not guarantee an instant upgrade.

And finally, the finished results — texts with indication of plagiarism, I started to put not in the database MySQL, and CouchDB. Still in a relational database, they are stored non-relational — in the text boxes in the format Data::Dumper (Perl is an analogue JSONа). CouchDB was all good, like the Queen of Sheba, but had one drawback, but fatal. To its base it is impossible to address with any request for any request needs to be pre-built indexes, i.e. they need to predict in advance. If a crystal ball there, you have to start the indexing process and for large databases it takes a few hours (!) while all other requests are not fulfilled. Now I would use MongoDB there is a background indexing.

The resulting scheme has a huge advantage over the old — it is naturally scaled. Indeed, no local data it has, everything is distributed (except the results), all the workflow instances are quite similar. They can be grouped according to the severity — on a single machine to run light, requiring few resources, and brake like a process of comparing texts to highlight the individual virtual server. A little? Does not pull? Can I get another one. Unable to cope some other process? Made and it on a separate machine. In principle, it is even possible to do automatically — we see that in one of the queues there are too many unprocessed messages, raised another EC2 instance.

However, the harsh aunt Life, as usual, made in this idyll is different. On the technical side, the architecture was perfect, but with economic found that the use of SDS (and S3) is very disadvantageous. Costs very expensive, especially the base.

I had to hurry to make intermediate data in old MySQL, and downloaded documents to put on the hard drive, shared via NFS. And at the same time to forget about smooth zoom.

the

Unrealized plans

Studying natural language processing, in particular, exhaustive the book of manning, I could not help thinking that all methods described there — just tricks of ad hoc tricks for a specific task, it is not generalize. LEM in 2001, Whelan, computer science, and not invented over forty years of artificial intelligence, although it popped a lot on this subject. Then he grimly predicted that the situation in the foreseeable future will not change. Machine as did not understand the meaning, understand and never will. The philosopher was right.

The same trick was and search of plagiarism. Well, to spawn the AI and wait for the human comprehension of the text, I did not expect, I'm not that naive, but I wandered the idea to attach a syntactic analysis at least to recognize the same meaning of the sentence, only standing in different deposits (active and passive). However, all parsers of natural language, which I found was extremely complicated, probabilistic, and gave results in a strange form and required enormous computational resources. In General, I think that at the current stage of development of science it is unreal.

the

the Human factor

the System was written to run in a fully automatic mode, so the people have nothing to podkislit could not. In addition, paired up with me worked very good sysadmin, through which all servers are set up was great, and the downtime of a different sort were reduced to a minimum. But there were still users support. Well, the boss, of course.
And those and others have long been convinced that the finding of plagiarism does not a computer, and a little person (or even a whole crowd) that sits inside the computer. It's almost like real, in particular, understands all of what is written in coursework on any topic and plagiarism finds because keeps in mind all content on the Internet. However, when these men screw up, asked, against all logic, somehow not with them but with me. One word — philologists.

I had great difficulty to explain that plagiarism is still a computer that does not understand what he is doing. Somewhere through the year to the top it came to rest don't seem to end.

Was from support yet another fashion — enter a few sentences into Google and happily I can report that the Google plagiarism found, my system does not. Well what could I say to that? To explain the distribution of texts, tell that for the speed and reduce the memory usage had to compromise, and each compromise has meant deterioration in the quality? Hopeless. Fortunately, in most such cases was that Google has found material on any pay-site in which to access the system simply did not exist.

Another thing was — to report that Turnitin has detected the plagiarism, and our system is not found. And then it was impossible to explain that Turnitin is likely, writes an entire team of qualified professionals with degrees in the relevant field, and the site itself has an intimate relationship with some cool search engine. Again, fortunately, most are not detected cases of plagiarism were from pay sites or from other student's works, in General, we did not available.

A few months I tried to meet the requirement of the Director on a fixed processing time — each work should not be checked more than an hour. I could not get around it, I could not sleep at night, until one day I didn't tell you that, in essence, I want to invent a perpetual motion machine — one whose power will rise with increasing load. In life such does not happen, in the world of software too. When the requirement has been reworded — each work not more than a certain amount (50 pages) to be looked up less than an hour, if at this time, no queue huge theses things went smoothly. The conditions were tough, but at least in principle feasible.

Sometimes pleasing support. To explain their logic, I find it difficult, but occasionally, under a heavy load, the queue check they... shoved in her additional copies of the works. I mean, if the tube is one hundred cars, we have the road to drive another hundred and then the case will go smoothly. Explain to them the error I was not able, and such things were banned purely administrative.

the

Farewell commentators

My sad experience shows that on Habre there are some m Article based on information from habrahabr.ru

Поиск по этому блогу

computer express