原文地址: Multi-Threading淺談
使用Multiple Theads的框架編程,熟悉Benchmarking性能測試工具,瞭解搜索引擎Search Engine的原理。git
This week we're working with threads. One step of creating a search engine is to index all the items we might want to search. This gets around the problem of having to search through everything each time we query.github
Here's an example: I want to get a list of movies that are a particular genre, but if I don't have an index, the only thing I can do is load up a list of movies and then search through them (possibly sorting via genre first, then binary search). But, if I've indexed the list of movies by genre, then I already have that list. Then, searching through that list of comedies becomes much more efficient.web
At this point, you've implemented the indexing system. We want to analyze our system to determine how many resources it's using. There are a few things to measure:編程
We basically have two ways to index: with either the TypeIndex (where we use SetOfMovies, indexing the movies by a field and storing them in a LinkedList of Movie structs), or the OffsetIndex (where we use MovieSet, and store just the DocIds and RowIds of where the movies are stored on disk).app
Start by clarifying your hypotheses:框架
There are clearly queries that will be answered faster with one index than the other:ide
To make the query comparison fair, let's define some sample queries:工具
Use the benchmarker to do some of these tests. Feel free to modify the code to measure different things to help explore your hypotheses.性能
An example invocation:測試
adrienne@virtualbox$ ./benchmarker data_small/ [snip] Created DocIdMap Cur Real Mem: 700 Peak Real Mem: 700 Cur VirtMem: 4376 PeakVirtMem: 4376 Building the OffsetIndex Parsing and indexing files... 22197 entries in the index. Took XXXX seconds to execute. Memory usage: Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX Destroyed OffsetIndex Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX Building the TypeIndex 10 entries in the index. Took XXXX seconds to execute. Memory usage: Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX Destroyed TypeIndex Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX Destroyed DocIdMap Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX
When you're done with this step:
Now that we have an indexer, we're going to multi-thread the FileParser.
Changes to make:
When you're done with this step:
Run your code on the provided "data" directory. The data directory was distributed as part of a7; either move that directory to the a9 directory, or point your program at a7/data.
Now that your program works, let's do some experimentation. We spoke in class about how a computer with a single processor won't have much benefit when using multiple threads. We can measure the difference though.
Further, since we're running a virtual machine, we can tell it how many cores to use.
If your machine has multiple cores, it will show on this screen. Crank up the cores, see how performance improves! (that is, run benchmarker again).
Submit your assignment by pushing your code to Github. Tag it with the tag a9-final.
Tag your last commit, and submit the tag on Blackboard by the due date, 9pm (Pacific Time).
Things to turn in:
If you mistakenly tag your repo as a9-final and find yourself having to create a new final tag, delete the - tag first. We don't want to figure out which of a9-final, a9-final-a, a9-final2, ... is the one we should be grading.
In the bigger scheme of things, we're putting pieces together to build a search engine. A search engine has a few components: a file crawler, a file processor, an indexer, and a query processor. The file crawler starts in one place, and traverses to find other files. On the web, this means starting at one page and following all the links on the page; in a file system, it means starting in one directory and traversing through the directory structure until all files have been foun. The file processor takes each of those files and finds significant words or concepts in the file. The indexer creates a data structure that (essentially) populates a data structure that makes it easy to find documents that have similar topics, or find documents that contain a certain word. The query processor lets us work with that data structure to get the results based on what we ask for.
In this homework, I'm providing you with a working movie indexer. We have a bunch of movie data, and you're going to build the pieces together to index the titles of these movies. Once the movies have been indexed, when we query the index, we will list out all of the movie titles that contain that word.
The trick is that we have about 5 million movies, videos and TV shows that we want to index. We can't just load them all into memory and search through them! We have to index them.
What does it mean to index them? Well, we start with a bunch of files (about 1000), and each file contains about 10,000 records. We open each file, get the words in the title of the movie from a row, and save which file and which row that record is in. Then, we want to find the movie record that has a title with a given word, we look up that word in our index, which has filename and row number of a movie with that word, and we open that file, go to that row, and print it out.
So what are you building in this assignment? First, you're going to write the part that does the file processing. This should be straightforward; you've done this a bunch already. But, we have so much data to deal with, we're going to throw a few threads at it at a time. Because your virtualbox is set up to have a single core, you won't see much improvement at first. But, you likely have multiple cores available to it. In the very last step, you'll attempt to run your code with multiple cores, which you will hopefully see an improvement.Finally, if you haven't already, you will be required to use the INTERFACES provided for this assignment. In fact, if you use the provided interfaces, it should be very straightforward, and much more complicated if you try to break the abstractions.
This week, you'll primarily be working with the FileParser (in orange), as well as the MainSystem.
We have a unique DocSet for every unique word in every movie title. The DocSet maintains a list of DocId's (which document contains the movie with this word) and a list of rows in that document (which row the specific movie is in).
When we ask the index for a set SearchResults for a given word, a list of docId-rowId pairs is returned. Then, we use that information to go look up the specific row in the file and return that row to the customer.
This is the overall flow:
(本文出自csprojectedu.com,轉載請註明出處)