-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.json
1 lines (1 loc) · 46.9 KB
/
index.json
1
[{"categories":["Tutorials"],"contents":" What is vector search The traditional lexical search works very well with structured data but what happens when we are dealing with unstructured data like images, video, raw text, etc? Vector search tries to address the limitations of the lexical search by providing the ability to query unstructured data. While the lexical search tries to match the literals of the words or their variants, vector search attempts to search based on the proximity of the data and query points in a multi-dimensional vector space. Semantic search uses vector search to achieve its ultimate goal - to focus on the intent or the meaning of data.\nSemantic search is achieved with deep learning and vector search. With neural network models, the unstructured data can be represented as a sequence of floating point values known as embedding vectors. These vector representations are then indexed (using Lucene API for example). Embedding vectors that are close to one another represent semantically similar pieces of data.\nVector search finds similar data using approximate nearest neighbor (ANN) algorithms. One such algorithm is the Hierarchical Navigable Small World Graphs (HNSW) and Lucene has its implementation for this algorithm. Query vectors that are produced using the same neural network models are then used to identify semantically similar pieces of data. Several corporations have started to leverage semantic search in solving interesting challenges and use cases like Spotify enabling users to find more relevant content using semantic search etc.\nA Hands-on with Vector Search and Lucene For this hands-on example, we have leveraged OpenAI\u0026rsquo;s Wikipedia embeddings dataset (25k documents). This dataset includes an embedded vector representation of the title and content fields. The query vector has been generated using OpenAI\u0026rsquo;s embeddings endpoint. The vector fields in the dataset and query vector have 1536 dimensions. There are mainly three sections of the code: Setup, Indexing of data, and Querying.\nSetup We need a place to keep the index files and for that, we used ByteBuffersDirectory to conveniently keep the index files in the heap memory. Alternatively, if we want to store the index files in the file system we can use FSDirectory.\nDirectory index = new ByteBuffersDirectory(); Addressing Lucene\u0026rsquo;s limitation of a maximum of 1024 dimensions for vector fields While many models are less than 1024 dimensions we are using an OpenAI based model which is of 1536 dimensions. Lucene by default has a limitation of a maximum of 1024 dimensions for vector fields and there have been discussions in the Lucene community on increasing this limit. Since our dataset and query vector is of 1536 dimensions we had to create a workaround to make this work. Our workaround involved setting a Lucene95Codec codec with an overridden getKnnVectorsFormatForField method to enable indexing of vectors with 1536 dimensions using HighDimensionKnnVectorsFormat.\nLucene95Codec knnVectorsCodec = new Lucene95Codec(Mode.BEST_SPEED) { @Override public KnnVectorsFormat getKnnVectorsFormatForField(String field) { int maxConn = 16; int beamWidth = 100; KnnVectorsFormat knnFormat = new Lucene95HnswVectorsFormat(maxConn, beamWidth); return new HighDimensionKnnVectorsFormat(knnFormat, 1536); } }; IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer()).setCodec(knnVectorsCodec); private static class HighDimensionKnnVectorsFormat extends KnnVectorsFormat { private final KnnVectorsFormat knnFormat; private final int maxDimensions; public HighDimensionKnnVectorsFormat(KnnVectorsFormat knnFormat, int maxDimensions) { super(knnFormat.getName()); this.knnFormat = knnFormat; this.maxDimensions = maxDimensions; } @Override public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException { return knnFormat.fieldsWriter(state); } @Override public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException { return knnFormat.fieldsReader(state); } @Override public int getMaxDimensions(String fieldName) { return maxDimensions; } } Indexing data All we do here is go through the documents in the dataset archive and add the documents to the index using an instance of the IndexWriter. KnnFloatVectorField is used to index title_vector and content_vector fields with cosine as a similarity function. These fields are the embedded vector representation of the actual title and content fields.\ntry (ZipFile zip = new ZipFile(\u0026quot;vector_database_wikipedia_articles_embedded.zip\u0026quot;); IndexWriter writer = new IndexWriter(index, config)) { CSVReader reader = new CSVReader(new InputStreamReader(zip.getInputStream(zip.entries().nextElement()))); String[] line; int count = 0; while ((line = reader.readNext()) != null) { if ((count++) == 0) continue; // skip the first line of the file, it is a header Document doc = new Document(); doc.add(new StringField(\u0026quot;id\u0026quot;, line[0], Field.Store.YES)); doc.add(new StringField(\u0026quot;url\u0026quot;, line[1], Field.Store.YES)); doc.add(new StringField(\u0026quot;title\u0026quot;, line[2], Field.Store.YES)); doc.add(new TextField(\u0026quot;text\u0026quot;, line[3], Field.Store.YES)); float[] titleVector = ArrayUtils.toPrimitive(Arrays.stream(line[4].replace(\u0026quot;[\u0026quot;, \u0026quot;\u0026quot;).replace(\u0026quot;]\u0026quot;, \u0026quot;\u0026quot;). split(\u0026quot;, \u0026quot;)).map(Float::valueOf).toArray(Float[]::new)); doc.add(new KnnFloatVectorField(\u0026quot;title_vector\u0026quot;, titleVector, VectorSimilarityFunction.COSINE)); float[] contentVector = ArrayUtils.toPrimitive(Arrays.stream(line[5].replace(\u0026quot;[\u0026quot;, \u0026quot;\u0026quot;).replace(\u0026quot;]\u0026quot;, \u0026quot;\u0026quot;). split(\u0026quot;, \u0026quot;)).map(Float::valueOf).toArray(Float[]::new)); doc.add(new KnnFloatVectorField(\u0026quot;content_vector\u0026quot;, contentVector, VectorSimilarityFunction.COSINE)); doc.add(new StringField(\u0026quot;vector_id\u0026quot;, line[6], Field.Store.YES)); if (count % 1000 == 0) System.out.println(count + \u0026quot; docs indexed ...\u0026quot;); writer.addDocument(doc); } writer.commit(); } catch (Exception e) { e.printStackTrace(); } Query Finally, we query the Lucene index. The query is Is the Atlantic the biggest ocean in the world?. We\u0026rsquo;ve borrowed the step mentioned in OpenAI\u0026rsquo;s cookbook to encode this query with OpenAI\u0026rsquo;s embedding model to generate the query vector. The query vector stored in the query.txt file is then used to query the content_vector field of the index by passing an instance of the KnnFloatVectorQuery to the IndexSearcher\u0026rsquo;s search method.\nIndexReader reader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(reader); for (String line: FileUtils.readFileToString(new File(\u0026quot;query.txt\u0026quot;), \u0026quot;UTF-8\u0026quot;).split(\u0026quot;\\n\u0026quot;)) { float queryVector[] = ArrayUtils.toPrimitive(Arrays.stream(line.replace(\u0026quot;[\u0026quot;, \u0026quot;\u0026quot;).replace(\u0026quot;]\u0026quot;, \u0026quot;\u0026quot;). split(\u0026quot;, \u0026quot;)).map(Float::valueOf).toArray(Float[]::new)); Query query = new KnnFloatVectorQuery(\u0026quot;content_vector\u0026quot;, queryVector, 1); TopDocs topDocs = searcher.search(query, 100); ScoreDoc[] hits = topDocs.scoreDocs; System.out.println(\u0026quot;Found \u0026quot; + hits.length + \u0026quot; hits.\u0026quot;); for (ScoreDoc hit: hits) { Document d = searcher.storedFields().document(hit.doc); System.out.println(d.get(\u0026quot;title\u0026quot;)); System.out.println(d.get(\u0026quot;text\u0026quot;)); System.out.println(\u0026quot;Score: \u0026quot; + hit.score); System.out.println(\u0026quot;-----\u0026quot;); } } Output Below is what the output looks like.\nStarting indexing of data ... 1000 docs indexed ... 2000 docs indexed ... 3000 docs indexed ... 4000 docs indexed ... 5000 docs indexed ... 6000 docs indexed ... 7000 docs indexed ... 8000 docs indexed ... 9000 docs indexed ... 10000 docs indexed ... 11000 docs indexed ... 12000 docs indexed ... 13000 docs indexed ... 14000 docs indexed ... 15000 docs indexed ... 16000 docs indexed ... 17000 docs indexed ... 18000 docs indexed ... 19000 docs indexed ... 20000 docs indexed ... 21000 docs indexed ... 22000 docs indexed ... 23000 docs indexed ... 24000 docs indexed ... 25000 docs indexed ... Running queries ... Found 1 hits. Atlantic Ocean The Atlantic Ocean is the world's second largest ocean. It covers a total area of about . It covers about 20 percent of the Earth's surface. It is named after the god Atlas from Greek mythology. Geologic history The Atlantic formed when the Americas moved west from Eurasia and Africa. This began sometime in the Cretaceous period, roughly 135 million years ago. It was part of the break-up of the supercontinent Pangaea. The east coast of South America is shaped somewhat like the west coast of Africa, and this gave a clue that continents moved over long periods of time (continental drift). The Atlantic Ocean is still growing now, because of sea-floor spreading from the mid-Atlantic Ridge, while the Pacific Ocean is said to be shrinking because the sea floor is folding under itself or subducting into the mantle. Geography The Atlantic Ocean is bounded on the west by North and South America. It connects to the Arctic Ocean through the Denmark Strait, Greenland Sea, Norwegian Sea and Barents Sea. It connects with the Mediterranean Sea through the Strait of Gibraltar. In the southeast, the Atlantic merges into the Indian Ocean. The 20° East meridian defines its border. In the southwest, the Drake Passage connects it to the Pacific Ocean. The Panama Canal links the Atlantic and Pacific. The Atlantic Ocean is second in size to the Pacific. It occupies an area of about . The volume of the Atlantic, along with its adjacent seas (the seas next to it), is 354,700,000 cubic kilometres. The average depth of the Atlantic, along with its adjacent seas, is . The greatest depth is Milwaukee Deep near Puerto Rico, where the Ocean is deep. Gulf Stream The Atlantic Ocean has important ocean currents. One of these, called the Gulf Stream, flows across the North Atlantic. Water gets heated by the sun in the Caribbean Sea and then moves northwest toward the North Pole. This makes France, the British Isles, Iceland, and Norway in Europe much warmer in winter than Newfoundland and Nova Scotia in Canada. Without the Gulf Stream, the climates of northeast Canada and northwest Europe might be the same, because these places are about the same distance from the North Pole. There are currents in the South Atlantic too, but the shape of this sea means that it has less effect on South Africa. Geology The main feature of the Atlantic Ocean's seabed is a large underwater mountain chain called the Mid-Atlantic Ridge. It runs from north to south under the Ocean. This is at the boundary of four tectonic plates: Eurasian, North American, South American and African. The ridge extends from Iceland in the north to about 58° south. The salinity of the surface waters of the open ocean ranges from 3337 parts per thousand and varies with latitude and season. References Other websites LA Times special Altered Oceans Oceanography Image of the Day, from the Woods Hole Oceanographic Institution National Oceanic and Atmospheric Administration NOAA In-situ Ocean Data Viewer Plot and download ocean observations www.cartage.org.lb www.mnsu.edu Score: 0.9364375 ----- Try it out yourself! For your convenience, we have put an end-to-end working example of the above out on GitHub for you to explore and play with. We\u0026rsquo;d love to hear from you!\nWhats next As a part of this series of blog posts, some of the next posts will cover our work in speeding up vector search in Lucene with GPUs.\n","permalink":"https://searchscale.com/blog/vector-search-with-lucene/","tags":["Lucene","Vector Search","LLM","KNN","OpenAI"],"title":"Vector Search with Lucene"},{"categories":["Features","SolrCloud"],"contents":" Solr nodes should be like cattle, not pets!\n Every week, somewhere in the world, at least one DevOps engineer responsible for a non-trivially sized Solr cluster thinks like this when they deal with SolrCloud operations such as cluster restarts. The sad reality is that Solr nodes still require careful hand holding (like pets do) during cluster wide changes to ensure zero downtime and stability.\nThe way single replica state changes are handled in the existing SolrCloud design limits the scalability potential of SolrCloud. The design worked fine back in the day when Solr clusters had handful of nodes, say less than 10, and a handful of collections. But, with 1000+ collections and a few tens or hundred nodes today, SolrCloud has some serious operational challenges in maintaining 100% uptime.\nIn Apache Solr 8.8 and 8.8.1, a new solution has been released. However, before jumping on to the solution, let us look at how single replica state changes are handled in Solr today.\nCurrent design for replica state updates Every single replica state change starts a cycle of the following operations:\n The replica posts a message into the overseer queue Overseer reads the message Overseer updates the state.json for the collection Overseer deletes the message from the queue Every node in the cluster that hosts the collection gets an event notification (via ZK watchers) about the change in state.json. They fetch from ZK and update their view of the collection. Challenges with current design You’d ask, what\u0026rsquo;s the problem with how Solr handles these changes? Let us look into that and see why this could become a problem:\n The number of events fired increases linearly with the number of replicas in a collection \u0026amp; the total number of collections The size of state.json increases linearly with the number of shards and replicas in a collection The number of Zookeeper reads and the size of data read from ZK increases quadratically with the number of nodes, collections, replicas Since a cluster has a single Overseer that processes the messages from the queue, an increase in the number of nodes, collections and replica can lead to a slowdown in processing state update messages, ultimately leading to a failure in the cluster. In such a scenario, the recovery of such a failed cluster becomes very hard.\nIf we quickly look at the situation with individual replica state changes, here are two main problems that affect the overall SolrCloud operations:\n Overseer Bottlenecks: Usually, in most production workloads, about 90-95% of the overseer messages are \u0026ldquo;state updates\u0026rdquo;. Other collection API operations (e.g. ADDREPLICA, CREATE, SPLITSHARD etc.) would get slowed down (or timeout) due to processing excessive state update messages.\n Instability: Restarting more than a few nodes at a time can lead to a cascading instability for the entire cluster due to generation of excessive state update messages (proportional to the number of replicas hosted on a node and number of nodes restarted).\n Introducing Per Replica States Apache Solr 8.8 and 8.8.1 has a new solution developed by Noble Paul and Ishan Chattopadhyaya, with support from FullStory.\nInstead of the approach where a single state.json contains structure of the collection as well as individual states, the solution follows \u0026ldquo;Per Replica State\u0026rdquo; approach as under:\n Every replica\u0026rsquo;s state is in a separate znode nested under the state.json, with a name that encodes the replica name, state and the leadership status. For nodes watching the states for a collection, a new \u0026ldquo;children watcher\u0026rdquo; (in addition to data watcher) is set on state.json. Upon a state change, a ZK multi-operation is used to (a) delete the previous znode for the replica, and (b) add a new znode with the updated state. This multi-operation is performed by individual nodes (that host the replica whose state is changing) directly, instead of going via overseer and overseer queue. With this approach, on a large Solr cluster (lots of nodes, lots of collections), it is easy to see the benefits of this solution.\n Minimize data writes/reads: With per replica state approach, the data written to ZK is dramatically reduced. For a simple state update, data written to ZK is just 10 bytes, instead of 100+ KB in case of single state model where every update affects a large collection. The data read by nodes is also minimal and the deserialization costs are negligible (no JSON parsing needed). Reduce overhead of overseer: State updates are performed as a direct znode update from the respective nodes Increased concurrency while writing to states: With PRS, we can modify the states of hundreds of replicas in a collection parallelly without any contention as each replica state is a separate node. This means a rolling restart of a cluster can be safely done with more nodes restarted at once than previous approach. The PRS approach reduces the memory pressure on Solr (on the overseer, as well as regular nodes), ultimately enhancing Apache Solr’s scalability potential. Design The state information for each replica is encoded as a child znode of the state.json znode for the collection. The overall structure of the collection (names and locations of shards, replicas etc.) is still reflected in state.json. This encoding follows the syntax: $N:$V:$S or $N:$V:$S:L, where $N is the core node name of the replica (as specified in state.json), $V is the version of the update (increases everytime this replica\u0026rsquo;s state has been updated), $S is the state (A for active, R for recovering, D for down). If the replica is a leader, a \u0026quot;:L\u0026quot; is appended. When a replica changes state (e.g. as result of a node restarting, or intermittent failures), state update messages directly affect these children znodes of the states. How to use this? This feature is enabled on a per-collection basis with a special flag (perReplicaState=true/false). When a collection is created, this parameter can be passed along to enable this feauture.\nhttp://localhost:8983/solr/admin/collections?action=CREATE\u0026amp;name=collection-name\u0026amp;numShards=1\u0026amp;perReplicaState=true This attribute is a modifiable atribute. So, an existing collection can be migrated to the new format using a MODIFYCOLLECTION command\nhttp://localhost:8983/solr/admin/collections?action=MODIFYCOLLECTION\u0026amp;collection=collection-name\u0026amp;perReplicaState=true Similarly, it can be switched back to the old format by flipping the flag\nhttp://localhost:8983/solr/admin/collections?action=MODIFYCOLLECTION\u0026amp;collection=collection-name\u0026amp;perReplicaState=false Conclusion In a subsequent post, we shall present benchmarks of this new solution compared to the baselines. Some of those have been discussed in https://issues.apache.org/jira/browse/SOLR-15052. As with all new features, please give this a try in a non-production environment first, and report bugs (if any) to Apache Solr JIRA.\n","permalink":"https://searchscale.com/blog/prs/","tags":["SolrCloud"],"title":"Per Replica States: Improving SolrCloud stability \u0026 reliability"},{"categories":["Tools"],"contents":" What is Solr Bench? Solr Bench is a flexible, configurable benchmarking suite for Solr. Given a dataset, set of queries, configset and a build of Solr, Solr Bench will run benchmarks against a deployed setup and output key performance metrics for indexing, querying, CPU usage, memory usage etc. This project was started as a Google Summer of Code project by Vivek Narang and later supported by FullStory.\nWho is this for? Solr developers / contributors / committers Can use this alongside CI systems to measure performance upon every commit point. Organizations using Solr Run before upgrading Solr versions to validate performance characteristics For performance tuning using various configuration parameters Hardware sizing and capacity estimation for planned workloads Choosing the right sharding and replication strategy for their traffic patterns and workloads How does it work? At the moment, there are two modes of provisioning Solr nodes for benchmarking:\n Local provisioning: Solr nodes (JVMs) will all be spun up on the same machine from where the benchmarking will be executed. Google Cloud Platform VM Instances: Solr Bench will spin up VM instances (as configured) and run Solr on them. This provisioning works via Terraform internally. The Solr build for the benchmarking can either be provided (as a tgz file) or Solr Bench can build it from a Git repository. The dataset for indexing can be either a TSV or a JSONL file.\nCheckout https://github.com/searchscale/solr-bench for more details.\n","permalink":"https://searchscale.com/blog/solr-bench/","tags":["Performance","Intermediate"],"title":"Solr Bench: Performance Benchmarking Suite"},{"categories":["Tutorials"],"contents":" What is Solr? In simple words, it is a search engine. Similar to a relational database system (like MySQL etc.), it can store textual, numeric, spatial or binary data and allows quick search and retrieval. Here’s a list of the equivalent concepts between a database system and Solr:\n Database System Solr Table Collection Row Document Column Field Solr is suitable for searching, filtering and faceting across full text fields and other types of fields and flexible for influencing ranked order of retrieved results based on relevance to the queries. A major difference between Solr and database systems is that Solr is not suitable for join operations across multiple collections, unlike database systems where joins across tables are very common. Database systems practitioners suggest data to be normalized, but it is recommended to have your data as de-normalized as possible in Solr. Solr offers tons of other features like searching across multiple fields at once, spell correction, highlighting, grouping, streaming functions, robust scaling features etc. We shall explore all of those in subsequent posts.\nRunning Solr For this tutorial, lets use Docker to start up Solr 8.5.2, create a collection, index a few documents and perform some search queries. This article assumes no prior knowledge of Docker; just assumes that Docker is already installed. To install Docker, visit https://get.docker.com.\ndocker run -it -p 8983:8983 -p 9983:9983 solr:8.5.2 /opt/solr/bin/solr -c -f Output:\n2020-06-20 14:19:20.985 INFO (main) [ ] o.e.j.u.log Logging initialized @965ms to org.eclipse.jetty.util.log.Slf4jLog 2020-06-20 14:19:21.152 INFO (main) [ ] o.e.j.s.Server jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 11.0.7+10 2020-06-20 14:19:21.213 INFO (main) [ ] o.e.j.d.p.ScanningAppProvider Deployment monitor [file:///opt/solr-8.5.2/server/contexts/] at interval 0 2020-06-20 14:19:21.456 INFO (main) [ ] o.e.j.w.StandardDescriptorProcessor NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet 2020-06-20 14:19:21.466 INFO (main) [ ] o.e.j.s.session DefaultSessionIdManager workerName=node0 2020-06-20 14:19:21.466 INFO (main) [ ] o.e.j.s.session No SessionScavenger set, using defaults 2020-06-20 14:19:21.469 INFO (main) [ ] o.e.j.s.session node0 Scavenging every 600000ms 2020-06-20 14:19:21.537 INFO (main) [ ] o.a.s.s.SolrDispatchFilter Using logger factory org.apache.logging.slf4j.Log4jLoggerFactory 2020-06-20 14:19:21.541 INFO (main) [ ] o.a.s.s.SolrDispatchFilter ___ _ Welcome to Apache Solr™ version 8.5.2 2020-06-20 14:19:21.542 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _ Starting in cloud mode on port 8983 2020-06-20 14:19:21.542 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \\__ \\/ _ \\ | '_| Install dir: /opt/solr 2020-06-20 14:19:21.542 INFO (main) [ ] o.a.s.s.SolrDispatchFilter |___/\\___/_|_| Start time: 2020-06-20T14:19:21.542368Z 2020-06-20 14:19:21.580 INFO (main) [ ] o.a.s.c.SolrResourceLoader Using system property solr.solr.home: /var/solr/data 2020-06-20 14:19:21.587 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading container configuration from /var/solr/data/solr.xml ... 2020-06-20 14:19:22.547 INFO (main) [ ] o.a.s.c.SolrZkServer STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9983 2020-06-20 14:19:22.547 WARN (main) [ ] o.a.s.c.SolrZkServer Embedded Zookeeper is not recommended in production environments. See Reference Guide for details. 2020-06-20 14:19:23.049 INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper client=localhost:9983 2020-06-20 14:19:23.094 INFO (main) [ ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper 2020-06-20 14:19:23.117 INFO (zkConnectionManagerCallback-7-thread-1) [ ] o.a.s.c.c.ConnectionManager zkClient has connected 2020-06-20 14:19:23.117 INFO (main) [ ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper ... 2020-06-20 14:19:23.940 INFO (main) [ ] o.e.j.s.AbstractConnector Started ServerConnector@56ace400{HTTP/1.1,[http/1.1, h2c]}{0.0.0.0:8983} 2020-06-20 14:19:23.941 INFO (main) [ ] o.e.j.s.Server Started @3924ms At this point, Solr is started up and is ready to go (point your browser to http://localhost:8983/solr to view the Solr’s Admin UI). Before we proceed, though, let us understand the various parts of the Docker command used to start Solr. The base command is “run” which is used to instantiate a Docker container within which Solr will be running.\n The flags -it instruct Docker to run the container interactively (as opposed to running it in the background) and allocating a pseudo TTY to go with it.\n The flag -p is used to expose a port from within a container and map it to a port opened in the host computer (where Docker is running). Since 8983 is the default Solr port, it is exposed through this mechanism so that we could interact with Solr now. The port 9983 in this example refers to a ZooKeeper port to which other Solr containers can potentially connect to later so as to form a Solr cluster of multiple Solr nodes.\n solr:8.5.2 refers to the application and version that needs to be started. In this case, the Solr application’s official Docker image will be pulled from the central Docker repositories (called Docker Hub) and the image will be used to start Solr containers.\n Here, /opt/solr/bin/solr -c -f is the command that starts Solr after the container is started. Inside the container, Solr is installed in the /opt/solr directory and the ./bin/solr script is used to start Solr. The -c parameter to the bin/solr script instructs it to start Solr in \u0026ldquo;cloud” mode or \u0026ldquo;SolrCloud\u0026rdquo; mode. It means that Solr would start up as part of a cluster in a distributed mode. In the cloud mode, an (embedded) instance of ZooKeeper, used for cluster coordination, would be started up alongside the Solr process; other Solr nodes can be made to be part of this SolrCloud cluster by connecting themselves to this ZooKeeper instance. The -f parameter instructs the bin/solr script to start Solr in the foreground mode so that the Docker container continues to run and the logs are displayed.\n Interacting with Solr: A books collection From a separate terminal, issue the following commands:\nCreate a collection: curl -X POST \\ http://localhost:8983/api/collections \\ -d '{ \u0026quot;create\u0026quot;: { \u0026quot;name\u0026quot;: \u0026quot;books\u0026quot;, \u0026quot;numShards\u0026quot;: 1 } }' Indexing documents into the collection: Indexing one document at a time:\ncurl -X POST \\ -D '{\u0026quot;id\u0026quot;:\u0026quot;1\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Hitchhikers Guide to the Galaxy\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;Douglas Adams\u0026quot;}' \\ http://localhost:8983/api/collections/books/update?commit=true curl -X POST \\ -D '{\u0026quot;id\u0026quot;:\u0026quot;2\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;My Family and Other Animals\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;Gerald Durrell\u0026quot;}' \\ http://localhost:8983/api/collections/books/update?commit=true curl -X POST \\ -D '{\u0026quot;id\u0026quot;:\u0026quot;3\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;1984\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;George Orwell\u0026quot;}' \\ http://localhost:8983/api/collections/books/update?commit=true curl -X POST \\ -D '{\u0026quot;id\u0026quot;:\u0026quot;4\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Lucene in Action\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;Erik Hatcher\u0026quot;}' \\ http://localhost:8983/api/collections/books/update?commit=true Or, batch indexing:\ncurl -X POST -d \\ '[ {\u0026quot;id\u0026quot;:\u0026quot;1\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Hitchhikers Guide to the Galaxy\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;Douglas Adams\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;2\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;My Family and Other Animals\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;Gerald Durrell\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;3\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;1984\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;George Orwell\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;4\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Lucene in Action\u0026quot;, \u0026quot;author\u0026quot;:\u0026quot;Erik Hatcher\u0026quot;} ]' \\ \u0026quot;http://localhost:8983/api/collections/books/update?commit=true\u0026quot; Search queries: Get all Solr documents (books):\ncurl http://localhost:8983/api/c/books/query -d '{\u0026quot;query\u0026quot;: \u0026quot;*:*\u0026quot;}' Search for books titled “lucene”:\ncurl http://localhost:8983/api/collections/books/query -d '{\u0026quot;query\u0026quot;: \u0026quot;title:lucene\u0026quot;}' Search for books by “orwell”:\ncurl http://localhost:8983/api/collections/books/query -d '{\u0026quot;query\u0026quot;: \u0026quot;author:orwell\u0026quot;}' Search for books by “douglas adams”\ncurl http://localhost:8983/api/collections/books/query -d '{\u0026quot;query\u0026quot;: \u0026quot;author:\\\u0026quot;douglas adams\\\u0026quot;\u0026quot;}' Conclusion This was just a quick 5 minute introduction. There are various nuances associated with collection creation (like sharding, replication), indexing (schema management etc.) and querying (different query parsers etc.). Refer to the official reference guide for more details.\n","permalink":"https://searchscale.com/blog/starting/","tags":["Beginner","Tutorial"],"title":"Getting Started with Apache Solr"},{"categories":["Usecases"],"contents":" New Support for Payloads in Solr Support for payloads existed in Lucene since long, but good out of the box support for payloads in Solr has finally been introduced in version 6.6. Concept of payloads can be understood as a per-document map of terms to values (string or numeric). Payloads can be used to store metadata for documents. Payloads lend themselves very well to implement a per-document weighted attribute set that can be used for filtering/scoring of those documents.\nIn this example, let us create a collection using a tiny portion of the Open Product Data. For this e-commerce oriented usecase of payloads, we’ll add clickstream based payloads to help in achieving better relevance for user queries.\nStart SolrCloud First of all, lets start Solr 8.5 (or any version \u0026gt;= 7.0) in cloud mode.\nDocker:\ndocker run -it -p 8983:8983 solr:8.5.2 /opt/solr/bin/solr -c -f Without Docker:\nwget http://www-us.apache.org/dist/lucene/solr/8.5.2/solr-8.5.2.tgz tar -xvf http://www-us.apache.org/dist/lucene/solr/8.5.2/solr-8.5.2.tgz cd solr-8.5.2 bin/solr -c Note: If you prefer using Postman, here is the postman collection that you can import.\nCreate a collection called “products” curl -X POST \u0026quot;http://localhost:8983/api/collections\u0026quot; -d \\ '{\u0026quot;create\u0026quot;: {\u0026quot;name\u0026quot;: \u0026quot;products\u0026quot;, \u0026quot;numShards\u0026quot;: 1}}' Index a few documents into the collection curl -X POST \\ 'http://localhost:8983/api/collections/products/update?commit=true' \\ -H 'content-type: application/json' \\ -d '[ {\u0026quot;id\u0026quot;:\u0026quot;5000147030156\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;bargains4you | Bargains4You Iphone 4, Iphone 4S, Iphone s, Iphone 5, Iphone Signal \u0026amp;amp; Wifi Boosters X 2\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0602956\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Marware Membrane iPhone Case - iPhone - Smoke - Polypropylene\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0602956009542\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Marware Membrane iPhone Case - iPhone - Smoke - Polypropylene\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;087956904\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Fosmon iPhone Case - iPhone - Purple - Silicone, Thermoplastic Polyurethane (TPU)\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0879569046442\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Belkin BodyGuard Hue iPhone Case - iPhone - Yellow, Light Graphite\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0879569046459\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Belkin Bodyguard Hue iPhone Case - iPhone - Garnet, Light Graphite\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0879569046541\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Fosmon iPhone Case - iPhone - Blue - Silicone, Thermoplastic Polyurethane (TPU)\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0879569046565\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Fosmon iPhone Case - iPhone - Orange - Silicone, Thermoplastic Polyurethane (TPU)\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0879569046572\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Fosmon iPhone Case - iPhone - Green - Silicone, Thermoplastic Polyurethane (TPU)\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0879569046589\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Fosmon iPhone Case - iPhone - Purple - Silicone, Thermoplastic Polyurethane (TPU)\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0885909229352\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Iphone Mb048ll A Smartphone\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;4713507011535\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;iMarkCase | Girrafe iPhone 4 / 4S Case - iPhone Designer Phone Case\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;4713507\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;iMarkCase | Tiger iPhone 4 / 4S Covers - iPhone 4S Custom Phone Case\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;8801105\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;iMarkCase | Spot iPhone 4 / 4S Covers - Design An iPhone Phone Case\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;4713507019135\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;iMarkCase | Tiger iPhone 4 / 4S Covers - iPhone 4S Custom Phone Case\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;8801105000535\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;iMarkCase | Spot iPhone 4 / 4S Covers - Design An iPhone Phone Case\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0758302\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Ipad Ipod \u0026amp;amp; Iphone Wall Charger\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0660543008323\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Iphone 4 Impact Case Blue\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0758302638604\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Ipad Ipod \u0026amp;amp; Iphone Wall Charger\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0885909459865\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Iphone 4s 32gb Black White\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0885909503865\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Iphone 4 32gb In Black\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;8801105000177\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;iMarkCase | Girl iPhone 4 / 4S Covers - Design Your Own iPhone 4S Phone Case\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0098689\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Lucky Hard Case Feather Iphone 4\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0047532893908\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Alarm Clock For Iphone Ipod Black\u0026quot;}, {\u0026quot;id\u0026quot;:\u0026quot;0098689392233\u0026quot;, \u0026quot;title\u0026quot;:\u0026quot;Lucky Hard Case Feather Iphone 4\u0026quot;}]' Query for iPhone curl \u0026quot;http://localhost:8983/solr/products/select?q=title:iphone\u0026quot; { \u0026quot;response\u0026quot;:{\u0026quot;numFound\u0026quot;:259,\u0026quot;start\u0026quot;:0,\u0026quot;docs\u0026quot;:[ { \u0026quot;id\u0026quot;:\u0026quot;5000147030156\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;bargains4you | Bargains4You Iphone 4, Iphone ... Boosters X 2\u0026quot;]}, { \u0026quot;id\u0026quot;:\u0026quot;0602956\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;Marware Membrane iPhone Case - iPhone - Smoke - Polypropylene\u0026quot;]}, { \u0026quot;id\u0026quot;:\u0026quot;0602956009542\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;Marware Membrane iPhone Case - iPhone - Smoke - Polypropylene\u0026quot;]}, { \u0026quot;id\u0026quot;:\u0026quot;087956904\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;Fosmon iPhone Case - iPhone - Purple - ... Polyurethane (TPU)\u0026quot;]}, { ... }} As you can see, the results here are all over the place in terms of relevance to user expectations. Usually, the intent of such queries are to explore the actual phone referred to as “iphone”, instead of cases and accessories for iPhone.\nAdding payloads Assume that we have gathered clickstream statistics through offline processing and we know the click count for a pair. Using payloads, we can associate the query and number of clicks (for that query) for each of the products. Here’s an example (for one of the most popular products):\ncurl -X POST 'http://localhost:8983/solr/products/update?commit=true' \\ -d '[{\u0026quot;id\u0026quot;:\u0026quot;0885909459865\u0026quot;, \u0026quot;queries_dpf\u0026quot;: {\u0026quot;set\u0026quot;: \u0026quot;iphone|20 apple|15\u0026quot;}}]' Here, queries_dpf is a dynamic float payload field. The above represents the scenario that the product “0885909459865” was clicked 20 times for the query “iphone” and 15 times for the query \u0026ldquo;apple\u0026rdquo;.\nSorting the results by the payload To let the click counts participate in the ranking of the results, we can sort the results by a payload function: payload(queries_dpf,iphone,0). The first parameter is the payload field name, the second is the term that has the payload and the last is the default payload value (in case there is no payload for that term).\ncurl \\ 'http://localhost:8983/solr/products/select?userquery=iphone\u0026amp;q=%7B!query%20v%3D%24userquery%7D\u0026amp;df=title\u0026amp;fl=id,title,payload(queries_dpf,%24userquery,0)\u0026amp;sort=payload(queries_dpf,%24userquery,0)%20DESC,score%20DESC' { \u0026quot;response\u0026quot;:{\u0026quot;numFound\u0026quot;:259,\u0026quot;start\u0026quot;:0,\u0026quot;docs\u0026quot;:[ { \u0026quot;id\u0026quot;:\u0026quot;0885909459865\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;Iphone 4s 32gb Black White\u0026quot;], \u0026quot;payload(queries_dpf,iphone,0)\u0026quot;:20.0}, { \u0026quot;id\u0026quot;:\u0026quot;5000147030156\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;bargains4you | Bargains4You ... Wifi Boosters X 2\u0026quot;], \u0026quot;payload(queries_dpf,iphone,0)\u0026quot;:0.0}, { \u0026quot;id\u0026quot;:\u0026quot;0602956\u0026quot;, \u0026quot;title\u0026quot;:[\u0026quot;Marware Membrane iPhone Case ... Polypropylene\u0026quot;], \u0026quot;payload(queries_dpf,iphone,0)\u0026quot;:0.0}, { ... ...}] } As you can see, the most clicked document for the query “iphone” was shown first, leading to a much more relevant user experience.\nInstead of sorting, you could also use the payload function in scoring as well. Also, you can filter by payloads. Check out the reference guide for the payload query parsers or the payload function.\n","permalink":"https://searchscale.com/blog/payloads/","tags":["Payloads","Advanced"],"title":"Payloads: Boost Popular Products for Queries"},{"categories":null,"contents":"","permalink":"https://searchscale.com/team/1-ishan/","tags":null,"title":" Ishan Chattopadhyaya"},{"categories":null,"contents":"","permalink":"https://searchscale.com/team/2-noble/","tags":null,"title":" Noble Paul"},{"categories":null,"contents":" Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit.\nQuia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\nBenifits of service Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n Quality Services Clients Satisfaction Quality Services Clients Satisfaction Quality Services Clients Satisfaction Business Strategy Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia dese runt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem acusantium.\n Quality Services Clients Satisfaction Quality Services Analyze your business Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n","permalink":"https://searchscale.com/service/infrastructure-auditing/","tags":null,"title":"Infrastructure Auditing"},{"categories":null,"contents":"","permalink":"https://searchscale.com/team/3-kishore/","tags":null,"title":"Kishore Angani"},{"categories":null,"contents":" Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit.\nQuia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\nBenifits of service Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n Quality Services Clients Satisfaction Quality Services Clients Satisfaction Quality Services Clients Satisfaction Business Strategy Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia dese runt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem acusantium.\n Quality Services Clients Satisfaction Quality Services Analyze your business Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n","permalink":"https://searchscale.com/service/performance-optimization/","tags":null,"title":"Performance Optimization"},{"categories":null,"contents":" Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit.\nQuia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\nBenifits of service Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n Quality Services Clients Satisfaction Quality Services Clients Satisfaction Quality Services Clients Satisfaction Business Strategy Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia dese runt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem acusantium.\n Quality Services Clients Satisfaction Quality Services Analyze your business Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n","permalink":"https://searchscale.com/service/solr-consulting/","tags":null,"title":"Solr Consulting"},{"categories":null,"contents":" Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit.\nQuia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\nBenifits of service Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n Quality Services Clients Satisfaction Quality Services Clients Satisfaction Quality Services Clients Satisfaction Business Strategy Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia dese runt mollit anim id est laborum. sed ut perspiciatis unde omnis iste natus error sit voluptatem acusantium.\n Quality Services Clients Satisfaction Quality Services Analyze your business Quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam.\n","permalink":"https://searchscale.com/service/training-knowledge-support/","tags":null,"title":"Training and Knowledge Support"}]