Report from the conference Lucene Revolution

In early October I was able to attend Lucene Revolution, which was held in the hero-city of Boston. The conference was dedicated to open search technology Apache Lucene and Apache Solr. It seems to me that on habré in particular and the Internet in General these technologies is given undeservedly little attention. Let's fix that omission.



I have long been engaged in these technologies, but could not imagine that the conference on this narrow specialized area can collect a number of speakers, participants and companies, the total number of participants was more than 300, more than 40 speakers. The range of implementations of data search technologies are quite impressive — a social network mikroblogerov services, CRM platforms, online shops, government websites, online and the University library, Dating sites and even projects in the field of bioinformatics.

LinkedIn

The largest social network of professional contacts contains more than 80 million user profiles, and serves approximately 350 million search queries per week.

In my opinion LinkedIn has the most advanced search across all social networks because of two things:
    the
  1. Using the social graph in the search
    That is, the sorting of the search results depends on the distance of users in the graph of social ties. And it's true — user is likely interested in a DOE from his circle of acquaintances, not the one whose ID in the system more or less completed profile. Additionally, the user has the possibility to filter the search results by a predetermined distance, the first circle / second circle.
  2. the
  3. Faceted navigation
    Long gone are the days when flat results search results was the cutting edge of the search interface. Online stores, content management system, specialized search engines have long enriched the results of special filters, which allow the user to narrow the search and eventually to get to the information. A key feature of UI is that the user sees in advance how many results he would get using a particular filter. LinkedIn provides a rich set of filters on the employer, location, University, etc.

At the moment engineers LinkedIn puzzled more a questions of scalability and performance of their solutions as well as instant update of the search index, and what was dedicated to their presentation.

Twitter, a billion queries a day

Every day, about 80 million new tweets are added to Twitter's database, and delay the appearance of new data in search index should not be more than a few seconds. The previous version of the search system was implemented using MySQL, based on technology from Summize. The new version uses a slightly redesigned library, Apache Lucene. Twitter is going in the near future to access your work and integrate them in Apache project.

Salesforce

The largest company in the field of SaaS CRM is also actively using Lucene. At the moment the index Salesforce takes more than 8 terabytes (20 billion documents). Their search is used by about 500 thousand users and the average load is around 4000 requests per second.

Loggly.com

Loggly is a startup that offers a Cloud solution for storing and analyzing your logs. You can configure your servers to send logs to syslog server this company and in consequence to search for logs and to use all sorts of Analytics. The core of the architecture is the Solr Cloud platform that provides the ability to index up to 100 thousand messages per second.
Archive.org

A commercial solution archive-it.org from the well-known company archive.org provides full-text searches of nearly 1 billion documents for more than 120 customers migrate to a solution based on Apache Solr. As a search spider they use a mix of own Heretix and customized Apache Nutch.

Search.USA.gov and WhiteHouse.gov

IT is the budget of the U.S. government is about $ 75 billion, so it is not surprising that government orders are very valuable to technology companies. The search engine on these government websites over the past ten years has come a long way in commercial solutions Inktomi (2000), Fast (2002), Vivisimo (2005) and eventually found stability in the open source Apache Lucene/Solr. Another interesting fact is that developers use Rackspace Hosting and Amazon Web Services for hosting, Pivotal Tracker as the tool for project management and github to store source code. Many commercial corporations are trying to keep complete control over all of these things inside your intranet, so to see such openness in developing the state government's decision quite amazing.

Libraries and institutions (HathiTrust, Yale, Smithsonian)

Various University libraries use the Apache Solr search the catalog, and providing a full text search on scanned and recognized books. The main challenges they face is to support different languages (CJK, compound words, etc.) and massive scalability for a reasonable price. HathiTrust search index 6.5 million scanned books (244 terabytes of images, 6 terabytes of recognized text) when using limited computational resources.

Bioinformatics (project Metarep)

This is perhaps the most surprising application of these technologies fundamentally different from full text search. I, frankly, little understood, except for the fact that in this field also raises questions of processing and analyzing large volumes of information.

what's New products Apache Lucene and Solr

Also, there were several presentations about new opportunities projects Apache Lucene and Apache Solr in the next release, such as:

Solr Cloud

functionality greatly simplifies the creation, configuration and support of the distributed cluster. At the core of the solution uses the Apache Zookeeper project, which confirmed its reliability in projects such as HBase and the set of solutions of Yahoo.

Geographic search (Solr Spatial Search)

It is no secret that in recent times it has gained popularity every geo-services (location based services). Often they need to solve the problem of finding different objects is not only the factor of remoteness from the user, but also apply common filters — full text, by category or tag, and so on. Apache Solr provides now the capability out of the box (it can be done before, but each came up with your own bike). Project Lucene/Solr is used by such major players in the industry as Yelp.com and YP.com

Instant search (Realtime search)

The data structure of the index and search algorithm such that the instantaneous updating index with new documents (every second or even fraction of a second) was and still is very difficult. This is due to the data structures in the memory cache, the merging index segments, the reset segment of the index from memory to disk, etc. Some companies (Twitter, LinkedIn, and others) work and achieved good progress in reducing this delay to a minimum.

Flexible mechanism for indexing in Lucene (Flexible Indexing)

At the moment, the Lucene library is seriously being rewritten, in order to provide developers the ability to control how the index is written to disk.

There were also sessions dedicated to the following issues and related technologies:
the
    the
  • to use Apache Solr as a NoSQL store
  • the
  • integration of crawler Apache Nutch
  • the
  • migrating a database enterprise search platform Fast to Apache Solr
  • the
  • and many more


Archive of reports of the conference is here
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

The release of the new version of the module modLivestreet 0.3.0-rc

mSearch: search + filter for MODX Revolution

Emulator data from GNSS receiver NMEA