Share to: share facebook share twitter share wa share telegram print page

Search engine (computing)

In computing, a search engine is an information retrieval software system designed to help find information stored on one or more computer systems. Search engines discover, crawl, transform, and store information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. The most widely used type of search engine is a web search engine, which searches for information on the World Wide Web.

A search engine normally consists of four components, as follows: a search interface, a crawler (also known as a spider or bot), an indexer, and a database. The crawler traverses a document collection, deconstructs document text, and assigns surrogates for storage in the search engine index. Online search engines store images, link data and metadata for the document.

How search engines work

Search engines provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query. In the case of text search engines, the search query is typically expressed as a set of words that identify the desired concept that one or more documents may contain.[1] There are several styles of search query syntax that vary in strictness. It can also switch names within the search engines from previous sites. Whereas some text search engines require users to enter two or three words separated by white space, other search engines may enable users to specify entire documents, pictures, sounds, and various forms of natural language. Some search engines apply improvements to search queries to increase the likelihood of providing a quality set of items through a process known as query expansion. Query understanding methods can be used as standardized query language.

Index-based search engine

The list of items that meet the criteria specified by the query is typically sorted, or ranked. Ranking items by relevance (from highest to lowest) reduces the time required to find the desired information. Probabilistic search engines rank items based on measures of similarity (between each item and the query, typically on a scale of 1 to 0, 1 being most similar) and sometimes popularity or authority (see Bibliometrics) or use relevance feedback. Boolean search engines typically only return items which match exactly without regard to order, although the term boolean search engine may simply refer to the use of boolean-style syntax (the use of operators AND, OR, NOT, and XOR) in a probabilistic context.

To provide a set of matching items that are sorted according to some criteria quickly, a search engine will typically collect metadata about the group of items under consideration beforehand through a process referred to as indexing. The index typically requires a smaller amount of computer storage, which is why some search engines only store the indexed information and not the full content of each item, and instead provide a method of navigating to the items in the search engine result page. Alternatively, the search engine may store a copy of each item in a cache so that users can see the state of the item at the time it was indexed or for archive purposes or to make repetitive processes work more efficiently and quickly.[2]

Other types of search engines do not store an index. Crawler, or spider type search engines (a.k.a. real-time search engines) may collect and assess items at the time of the search query, dynamically considering additional items based on the contents of a starting item (known as a seed, or seed URL in the case of an Internet crawler). Meta search engines store neither an index nor a cache and instead simply reuse the index or results of one or more other search engine to provide an aggregated, final set of results.

Database size, which had been a significant marketing feature through the early 2000s, was similarly displaced by emphasis on relevancy ranking, the methods by which search engines attempt to sort the best results first. Relevancy ranking first became a major issue c. 1996, when it became apparent that it was impractical to review full lists of results. Consequently, algorithms for relevancy ranking have continuously improved. Google's PageRank method for ordering the results has received the most press, but all major search engines continually refine their ranking methodologies with a view toward improving the ordering of results. As of 2006, search engine rankings are more important than ever, so much so that an industry has developed ("search engine optimizers", or "SEO") to help web-developers improve their search ranking, and an entire body of case law has developed around matters that affect search engine rankings, such as use of trademarks in metatags. The sale of search rankings by some search engines has also created controversy among librarians and consumer advocates.[3]

Google's "Knowledge Panel." This is how information from the Knowledge Graph is presented to users.

Search engine experience for users continues to be enhanced. Google's addition of the Google Knowledge Graph has had wider ramifications for the Internet, possibly even limiting certain websites traffic, for example Wikipedia. By pulling information and presenting it on Google's page, some argue that it can negatively affect other sites. However, there have been no major concerns.[4]

Search engine categories

Web search engines

Search engines that are expressly designed for searching web pages, documents, and images were developed to facilitate searching through a large, nebulous blob of unstructured resources. They are engineered to follow a multi-stage process: crawling the infinite stockpile of pages and documents to skim the figurative foam from their contents, indexing the foam/buzzwords in a sort of semi-structured form (database or something), and at last, resolving user entries/queries to return mostly relevant results and links to those skimmed documents or pages from the inventory.

Crawl

In the case of a wholly textual search, the first step in classifying web pages is to find an ‘index item’ that might relate expressly to the ‘search term.’ In the past, search engines began with a small list of URLs as a so-called seed list, fetched the content, and parsed the links on those pages for relevant information, which subsequently provided new links. The process was highly cyclical and continued until enough pages were found for the searcher's use. These days, a continuous crawl method is employed as opposed to an incidental discovery based on a seed list. The crawl method is an extension of aforementioned discovery method.

Most search engines use sophisticated scheduling algorithms to “decide” when to revisit a particular page, to appeal to its relevance. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity, and overall quality of site. The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in.

Pages that are discovered by web crawls are often distributed and fed into another computer that creates a map of resources uncovered. The bunchy clustermass looks a little like a graph, on which the different pages are represented as small nodes that are connected by links between the pages. The excess of data is stored in multiple data structures that permit quick access to said data by certain algorithms that compute the popularity score of pages on the web based on how many links point to a certain web page, which is how people can access any number of resources concerned with diagnosing psychosis. Another example would be the accessibility/rank of web pages containing information on Mohamed Morsi versus the very best attractions to visit in Cairo after simply entering ‘Egypt’ as a search term. One such algorithm, PageRank, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention because it highlights repeat mundanity of web searches courtesy of students that don't know how to properly research subjects on Google.

The idea of doing link analysis to compute a popularity rank is older than PageRank. However, In October 2014, Google’s John Mueller confirmed that Google is not going to be updating it (Page Rank) going forward. Other variants of the same idea are currently in use – grade schoolers do the same sort of computations in picking kickball teams. These ideas can be categorized into three main categories: rank of individual pages and nature of web site content. Search engines often differentiate between internal links and external links, because web content creators are not strangers to shameless self-promotion. Link map data structures typically store the anchor text embedded in the links as well, because anchor text can often provide a “very good quality” summary of a web page's content.

Database Search Engines

Searching for text-based content in databases presents a few special challenges from which a number of specialized search engines flourish. Databases can be slow when solving complex queries (with multiple logical or string matching arguments). Databases allow pseudo-logical queries which full-text searches do not use. There is no crawling necessary for a database since the data is already structured. However, it is often necessary to index the data in a more economized form to allow a more expeditious search.

Mixed Search Engines

Sometimes, data searched contains both database content and web pages or documents. Search engine technology has developed to respond to both sets of requirements. Most mixed search engines are large Web search engines, like Google. They search both through structured and unstructured data sources. Take for example, the word ‘ball.’ In its simplest terms, it returns more than 40 variations on Wikipedia alone. Did you mean a ball, as in the social gathering/dance? A soccer ball? The ball of the foot? Pages and documents are crawled and indexed in a separate index. Databases are indexed also from various sources. Search results are then generated for users by querying these multiple indices in parallel and compounding the results according to “rules.”

History of search technology

The Memex

The concept of hypertext and a memory extension originates from an article that was published in The Atlantic Monthly in July 1945 written by Vannevar Bush, titled "As We May Think". Within this article Vannevar urged scientists to work together to help build a body of knowledge for all mankind. He then proposed the idea of a virtually limitless, fast, reliable, extensible, associative memory storage and retrieval system. He named this device a memex.[5]

Bush regarded the notion of “associative indexing” as his key conceptual contribution. As he explained, this was “a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.[6]

All of the documents used in the memex would be in the form of microfilm copy acquired as such or, in the case of personal records, transformed to microfilm by the machine itself. Memex would also employ new retrieval techniques based on a new kind of associative indexing the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another to create personal "trails" through linked documents. The new procedures, that Bush anticipated facilitating information storage and retrieval would lead to the development of wholly new forms of the encyclopedia.

The most important mechanism, conceived by Bush, is the associative trail. It would be a way to create a new linear sequence of microfilm frames across any arbitrary sequence of microfilm frames by creating a chained sequence of links in the way just described, along with personal comments and side trails.

In 1965, Bush took part in the project INTREX of MIT, for developing technology for mechanization the processing of information for library use. In his 1967 essay titled "Memex Revisited", he pointed out that the development of the digital computer, the transistor, the video, and other similar devices had heightened the feasibility of such mechanization, but costs would delay its achievements.[7]

SMART

Gerard Salton, who died on August 28 of 1995, was the father of modern search technology. His teams at Harvard and Cornell developed the SMART informational retrieval system. Salton's Magic Automatic Retriever of Text included important concepts like the vector space model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values, and relevancy feedback mechanisms.

He authored a 56-page book called A Theory of Indexing which explained many of his tests, upon which search is still largely based.

String Search Engines

In 1987, an article was published detailing the development of a character string search engine (SSE) for rapid text retrieval on a double-metal 1.6-μm n-well CMOS solid-state circuit with 217,600 transistors lain out on a 8.62x12.76-mm die area. The SSE accommodated a novel string-search architecture which combines a 512-stage finite-state automaton (FSA) logic with a content addressable memory (CAM) to achieve an approximate string comparison of 80 million strings per second. The CAM cell consisted of four conventional static RAM (SRAM) cells and a read/write circuit. Concurrent comparison of 64 stored strings with variable length was achieved in 50 ns for an input text stream of 10 million characters/s, permitting performance despite the presence of single character errors in the form of character codes. Furthermore, the chip allowed nonanchor string search and variable-length `don't care' (VLDC) string search.[8]

See also

By source

By content type

By interface

By topic

Others

References

  1. ^ Voorhees, E.M. Natural Language Processing and Information Retrieval[permanent dead link]. National Institute of Standards and Technology. March 2000.
  2. ^ "Internet Basics: Using Search Engines". GCFGlobal.org. Retrieved 2022-07-11.
  3. ^ Stross, Randall (22 September 2009). Planet Google: One Company's Audacious Plan to Organize Everything We Know. Simon and Schuster. ISBN 978-1-4165-4696-2. Retrieved 9 December 2012.
  4. ^ "What do we make of Wikipedia's falling traffic?". The Daily Dot. 2014-01-08. Retrieved 2020-11-01.
  5. ^ Yeo, Richard (30 January 2007). "Before Memex: Robert Hooke, John Locke, and Vannevar Bush on External Memory". Science in Context. 20 (1): 21. doi:10.1017/S0269889706001128. hdl:10072/15207. S2CID 2378301.
  6. ^ Yeo, Richard (30 January 2007). "Before Memex: Robert Hooke, John Locke, and Vannevar Bush on External Memory". Science in Context. 20 (1): 21–47. doi:10.1017/S0269889706001128. hdl:10072/15207. S2CID 2378301The example Bush gives is a quest to find information on the relative merits of the Turkish short bow and the English long bow in the crusades{{cite journal}}: CS1 maint: postscript (link)
  7. ^ "The MEMEX of Vannevar Bush". 4 January 2021. Archived from the original on 7 January 2021. Retrieved 12 August 2023.
  8. ^ Yamada, H.; Hirata, M.; Nagai, H.; Takahashi, K. (Oct 1987). "A high-speed string-search engine". IEEE Journal of Solid-State Circuits. 22 (5). IEEE: 829–834. Bibcode:1987IJSSC..22..829Y. doi:10.1109/JSSC.1987.1052819.

Read other articles:

Galaxy cluster in the constellation Virgo You can help expand this article with text translated from the corresponding article in Italian. (August 2017) Click [show] for important translation instructions. View a machine-translated version of the Italian article. Machine translation, like DeepL or Google Translate, is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine…

У этого термина существуют и другие значения, см. Салтыки. ДеревняСалтыки 58°45′00″ с. ш. 50°18′27″ в. д.HGЯO Страна  Россия Субъект Федерации Кировская область Муниципальный район Слободской Сельское поселение Ильинское История и география Первое упоминание 1678 Пр…

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Ivor Richardson – news · newspapers · books · scholar · JSTOR (June 2015) (Learn how and when to remove this template message) The Right HonourableSir Ivor RichardsonPCNZM QCPresident of the Court of AppealIn office1996–2002Preceded byRobin Cooke, Baron Cook…

Russian pianist Anastasia Gromoglasova, 2015 Anastasia Gromoglasova (Russian: Анастасия Николаевна Громогласова; born 18 September 1984) is a Russian classical pianist. Biography and education Anastasia Gromoglasova is a Russian pianist and a winner of several international awards. In 1994 Anastasia joined the Moscow Conservatory where she continued her education attending Elena Kuznetsova's piano solo class, Elena Sorokina's chamber music class and Vazha Tchatchav…

Disambiguazione – Se stai cercando la chiesa parrocchiale del quartiere Forlanini, vedi Chiesa di San Nicolao della Flue (Milano). Chiesa dei Santi Nicola ed Espeditodetta San NicolaoFacciataStato Italia RegioneLombardia LocalitàMilano IndirizzoVia San Nicolao e Via S. Nicolao Coordinate45°27′59.4″N 9°10′37″E / 45.4665°N 9.176944°E45.4665; 9.176944Coordinate: 45°27′59.4″N 9°10′37″E / 45.4665°N 9.176944°E45.4665; 9.176944 Religioneca…

Daniel JohnsonDatos personalesNacimiento Carnarvon, Australia Occidental,  Australia3 de mayo de 1988 (35 años)Nacionalidad(es) AustralianaAltura 2,12 m (6′ 11″)Peso 108 kg (238 lb)Carrera deportivaDeporte BaloncestoEquipo universitario PepperdineClub profesionalDraft de la NBA No elegido, 2009Debut deportivo 2008 Melbourne TigersClub Adelaide 36ersLiga NBL AustraliaPosición PívotTrayectoria Melbourne TigeRs (2008 - 2010) Willetton Tigers (2010) Adelaide 36ers (2010 - …

جزء من تصنيف:جنسجنس مصطلحات بيولوجية مثنوية الشكل الجنسية ذكر أنثى تمايز جنسي تأنيث تذكير نظام تحديد الجنس XY X0 ZW Z0 Temperature-dependent فردانية ضعفانية Sex chromosome كروموسوم إكس كروموسوم واي عامل تحديد الخصية خنثى Sequential hermaphroditism ثنائية الجنس تكاثر جنسي تطور التكاثر الجنسي تكاثر متباين ا

Eda Nemoede CastertonInformación personalNacimiento 14 de abril de 1877 Brillion (Wisconsin) (Estados Unidos) Fallecimiento 15 de noviembre de 1969 (92 años)Palos Verdes Estates (Estados Unidos) Nacionalidad EstadounidenseEducaciónEducada en Escuela del Instituto de Arte de Chicago Información profesionalOcupación Pintora y artista Área Pintura Género Retrato y miniatura [editar datos en Wikidata] Eda Nemoede Casterton (14 de abril de 1877 - 15 de noviembre de 1969) fue una pint…

American documentary miniseries Small Town News: KPVM PahrumpGenreDocumentaryCountry of originUnited StatesOriginal languageEnglishNo. of episodes6ProductionExecutive producers Fenton Bailey Randy Barbato Nikki Calabrese Nelson Walters Nancy Abraham Lisa Heller Producers Christi Martinelli Steven Sims Cinematography Diego Lopez Arlene Nelson Editors Francy Kachler George Mandi Chris McKinley Ryan Neill Running time28-29 minutesProduction companies HBO Documentary Films World of Wonder Original r…

1987 mid-air collision SkyWest Airlines Flight 1834AccidentDateJanuary 15, 1987SummaryMid-air collision caused by pilot error on the Mooney M20 aircraft and air traffic controller errorsSiteKearns, Utah, U.S. 40°39′20″N 112°0′0″W / 40.65556°N 112.00000°W / 40.65556; -112.00000Total fatalities10Total survivors0First aircraft N163SW, the Skywest SA226TC Metro IIinvolved in the accident, in June 1980TypeSwearingen SA226-TC Metro IIOperatorSkyWest AirlinesICAO fli…

National television service in Papua New Guinea This article is about the commercial television station in Papua New Guinea. For other uses, see EM TV (disambiguation). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: EM TV – news · newspapers · books · scholar · JSTOR (January 2022) (Learn how and when to remov…

1999 Indian filmSuyamvaramDirected by14 directorsScreenplay bySirajStory byGiridharilal NagpalProduced byGiridharilal NagpalCinematography17 cinematographersEdited by4 editorsMusic by4 composersProductioncompanyL. S. MoviesRelease date 16 July 1999 (1999-07-16) Running time155 minutesCountryIndiaLanguageTamil Suyamvaram (transl. Betrothal ceremony) is a 1999 Indian Tamil-language comedy drama film starring an ensemble cast from actors in the Tamil film industry and shot by a…

Railway line in Nara prefecture, Japan This article relies largely or entirely on a single source. Relevant discussion may be found on the talk page. Please help improve this article by introducing citations to additional sources.Find sources: Sakurai Line – news · newspapers · books · scholar · JSTOR (March 2007) Sakurai Line227-1000 series at Nara StationOverviewOther name(s)Manyō-Mahoroba LineOwnerJR WestLocaleNara PrefectureTerminiNaraTakadaStations1…

Mosteiro do Salvador de Paço de Sousa, um dos 58 monumentos inseridos na rota. A Rota do Românico é uma rota turístico-cultural, composta por 58 monumentos de estilo românico na região do Tâmega e Sousa, em Portugal. A Rota do Românico surgiu a partir da necessidade de aproveitar o potencial de qualificação cultural e turística e desenvolver de forma sustentável a região. Foi criada graças ao Plano de Desenvolvimento Integrado do Vale do Sousa, em colaboração com o Instituto Port…

434 Operational Test and Evaluation SquadronActive1943–19451953–19621963–19671968–20002018–presentCountryCanadaBranchRoyal Canadian Air ForceRoleVariousPart ofRoyal Canadian Air Force Aerospace Warfare CentreBaseCFB TrentonMotto(s)In excelsis vincimus (Latin for 'We conquer in the heights')Battle honoursEnglish Channel and North Sea 1943–1944, Baltic 1943–1944, Fortress Europe 1943–1944, France and Germany 1944–1945, Biscay Ports 1944, Ruhr 1943–1945, Berlin…

36°18′41″N 45°59′00″E / 36.3114°N 45.9833°E / 36.3114; 45.9833 إينتشكه (ترجان سقز) تقسيم إداري البلد إيران محافظة كردستان مقاطعة سقز قسم مركزي السكان التعداد السكاني 21 نسمة (في سنة 2006) تعديل مصدري - تعديل   قرية إينتشكه (بالكردية: ئینچکە) هي إحدى القرى التابعة لـترجان في ريف قسم مركز…

Легка атлетикаБіг на 1500 метрів Фейт Кіп'єгон (2017)Умови проведенняМісце просто небаПоверхня бігова доріжкаРекорди (жінки)світу Фейт Кіп'єгон3.49,11 (2023)Європи Сіфан Гассан3.51,95 (2019)України Надія Ралдугіна3.56,63 (1984) Світові рекорди з бігу на 1500 метрів визнаються Світовою легкою ат…

Season of television series The VoiceSeason 3Promotional posterHosted byCarson DalyChristina Milian (social media)CoachesAdam LevineCeeLo GreenChristina AguileraBlake SheltonNo. of contestants64 artistsWinnerCassadee PopeWinning coachBlake SheltonRunner-upTerry McDermott ReleaseOriginal networkNBCOriginal releaseSeptember 10 (2012-09-10) –December 18, 2012 (2012-12-18)Season chronology← PreviousSeason 2Next →Season 4 The third season of the American reality talent …

26th Chief Justice of the Wisconsin Supreme Court Pat RoggensackRoggensack in 200526th Chief Justice of the Wisconsin Supreme CourtIn officeApril 29, 2015 – April 30, 2021Preceded byShirley AbrahamsonSucceeded byAnnette ZieglerJustice of the Wisconsin Supreme CourtIn officeAugust 1, 2003 – July 31, 2023[1]Preceded byWilliam A. BablitchSucceeded byJanet ProtasiewiczJudge of the Wisconsin Court of Appealsfor the 4th districtIn officeAugust 1, 1996 – July 31…

Filipino politician In this Philippine name for married women, the birth middle name or maternal family name is Lim, the birth surname or paternal family name is Cajayon, and the marital name is Uy. Mary Mitzi CajayonOfficial portrait during the 19th CongressMember of the Philippine House of Representatives from Caloocan's 2nd DistrictIncumbentAssumed office June 30, 2022Preceded byEdgar EriceIn officeJune 30, 2007 – June 30, 2013Preceded byLuis AsistioSucceeded byEdgar Erice…

Kembali kehalaman sebelumnya