References

To make references as accessible as possible, we have included URLs where possible. Unfortunately, over time, Web sites disappear, accounts are reorganized, and hyperlinks break. If you find a broken link, you may be able to find the paper by pasting the title and author names into a general search engine such as umnv.google.com or an academic paper collection such as citeseer.com, or pasting the broken URL into the Wayback Machine at www.archive.org.

[1] R. Agrawal, R. J. Bayardo, and R. Srikant. Athena: Mining-based interactive management of text databases. In 7th International Conference on Extending Database Technology (EDBT), Konstanz, Germany, March 2000. www.almaden.ibm.com/cs/people/ragrawal/papers/athena.ps.
[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD Conference on Management of Data, Seattle, WA, June 1998. www.almaden.ibm.com/cs/quest/papers/sigmod98_clique.pdf.
[3] J. Allan. Automatic hypertext link typing. In 7th ACM Conference on Hypertext, Hypertext ‘96, pages 42–51, 1996.
[4] J. Allen. Natural Language Understanding. Benjamin Cummings, 1987, 1995.
[5] B. Amento, L. G. Terveen, and W. C. Hill. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In SIGIR, pages 296–303. ACM, 2000. citeseer.nj.nec.com/417258.html.
[6] E. L. Antworth. PC-KIMMO: A two-level processor for morphological analysis. Summer Institute of Linguistics, International Academic Bookstore, Dallas, 1990. www.sil.org/pckimmo/pc-kimmo.html.
[7] C. Apte, E Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994. Also published as IBM Research Report RC18879.
[8] D. J. Arnold, L. Balkan, R. L. Humphreys, S. Meijer, and L. Sadler. Machine translation: An introductory guide, 1995. clwuw.essex.ac.uk/~doug/book/book.html, cluww.essex.ac.uk/MTbook/, www.essex.ac.uk/linguistics/clmt/MTbook/.
[9] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spectral analysis for data Mining. In STOC, vol. 33, pages 619–626, 2001.
[10] Babelfish Language Translation Service, www.altavista.com, 1998.
[11] F. Bacon. The Advancement of Learning. Clarendon Press, 1873.
[12] Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases (VLDB), pages 535–544, 2000. www.cs.berkeley.edu/~zivi/papers/webwalker/webwalker.ps.gz.
[13] A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286, pages 509–512, 1999.
[14] A. Berg. Random jumps in Web Walker. Personal communication, April 2001.
[15] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), pages 573–595, 1995. www.cs.utk.edu/~library/TechReports/1994/ut-cs-94–270.ps.z.
[16] K. Bharat and A. Bröder. A technique for measuring the relative size and overlap of public Web search engines. In 7th World Wide Web Conference (WWW7), 1998. www7.scu.edu.au/programme/fullpapers/l937/com1937.htm; also see update at www.research.digital.com/SRC/whatsnew/sem.html.
[17] K. Bharat, A. Z. Bröder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science and Technology, 51(12), pages 1114–1122, 2000. www.research.digital.com/SRC/personal/monika/papers/wows.ps.gz.
[18] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104–111, August 1998. www.henzinger.com/monika/mpapers/sigir98_ps.ps.
[19] S. Blackmore. The Meme Machine. Oxford University Press, 1999.
[20] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. www.ics.uci.eau/~mlearn/MLRepository.html.
[21] W.J. Bluestein. Hypertext versions of journal articles: Computer aided linking and realistic human evaluation. Ph.D. thesis, University of Western Ontario, 1999.
[22] A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training. Computational Learning Theory, pages 92–100, 1998.
[23] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In 4th International Conference on Knowledge Discovery and Data Mining, August 1998. www.ece.nwu.edu/~harsha/Clustering/scaleKM.ps.
[24] B. E. Brewington and G. Cybenko. Keeping up with the changing Web. IEEE Computer, 33(5), pages 52–58, 2000.
[25] E. Brill. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing, pages 152–155, 1992. www.cs.jhu.edu/~brill/acadpubs.html and citeseer.nj.nec.com/brill92simple.html.
[26] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th World Wide Web Conference (WWW7), 1998. decweb.ethz.ch/WWW7/1921/com 1921.htm.
[27] A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference, pages 391–404, April 1997. Also appeared as SRC Technical Note 1997–015; see research.compaq.com/SRC/WebArcheology/syntactic.html.
[28] A. Bröder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web: Experiments and models. In WWW9, pages 309–320, Amsterdam, May 2000. Elsevier Science, www9.org/w9cdrom/160/160.html.
[29] V. Bush. As we may think. The Atlantic Monthly, July 1945. www.theatlantic.com/unbound/flashbks/computer/bushf.htm.
[30] M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), pages 328–334, July 1999.
[31] C. Cardie and D. Pierce. The role of lexicalization and pruning for base noun phrase grammars. In AAA1 99, pages 423–430, July 1999.
[32] B. Carlin and T. Louis. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall, 1996.
[33] J. Carriere and R. Kazman. WebQuery: Searching and visualizing the Web through connectivity. In WWW6, pages 701–7-11, 1997. www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html.
[34] N. Catenazzi and F. Gibb. The publishing process: The hyperbook approach. Journal of Information Science, 21(3), pages 161–172, 1995.
[35] S. Chakrabarti and Y. Batterywala. Mining themes from bookmarks. In ACM SIGKDD Workshop on Text Mining, Boston, August 2000. www.cse.iitb.ac.in/~soumen/doc/kdd2000/theme2.ps.
[36] S. Chakrabarti and B. E. Dom. Feature diffusion across hyperlinks. U.S. Patent No. 6,125,361, April 1998. IBM Corp.
[37] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal, August 1998. www.cse.iitb.ac.in/~soumen/doc/vldbj1998/.
[38] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. In 7th World Wide Web Conference (WWW7), 1998. www7.scu.edu.au/programme/fullpapers/1898/com1898.html.
[39] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD Conference. ACM, 1998. www.cse.iitb.ac.in/~soumen/doc/sigmod98/.
[40] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web’s link structure. IEEE Computer, 32(8), pages 60–67, August 1999.
[41] S. Chakrabarti, D. A. Gibson, and K. S. McCurley. Surfing the Web backwards. In WWW, vol 8, Toronto, May 1999.
[42] S. Chakrabarti, M. M. Joshi, and V. B. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In SIGIR, vol. 24. ACM, New Orleans, September 2001.
[43] S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online reference feedback. WWW, pages 148–159. ACM, Honolulu, May 2002. www2002.org/CDROM/refereed 1336/index.html.
[44] S. Chakrabarti, S. Srivastava, M. Subramanyam, and M. Tiwari. Using Memex to archive and mine community Web browsing experience. Computer Networks, 33(1–6), pages 669–684, May 2000. www9.org/w9cdrom/98/98.html.
[45] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31, pages 1623–1640, 1999. First appeared in the 8th International World Wide Web Conference, Toronto, May 1999. www8.org/w8-papers/5a-search-query/crawling/.
[46] E. Charniak. A maximum-entropy-inspired parser. Computer Science Technical Report CS-99–12, Brown University, August 1999. www.cs.hrown.edu/people/ec/.
[47] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, 1996. ic.arc.nasa.gov/ic/projects/bayes-group/images/kdd-95.ps.
[48] C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In 6th World Wide Web Conference, San Jose, CA, 1996.
[49] R. Chellappa and A. Jain. Markov random fields: Theory and applications. Academic Press, 1993.
[50] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. In 7th World Wide Web Conference, Brisbane, Australia, April 1998. www7.scu.edu.au/programme/fullpapers/1919/coml919.htm.
[51] J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated Web collections. In ACM International Conference on Management of Data (SIGMOD), May 2000. www-db.stanford.edu/~cho/papers/cho-mirror.pdf.
[52] E. Cohen and D. D. Lewis. Approximating matrix multiplication for pattern recognition tasks. Journal of Algorithms, 30, pages 211–252, 1999. Special issue of selected papers from SODA’97. www.research.att.com/~edith/publications.html.
[53] W. W. Cohen. Fast effective rule induction. In 12th International Conference on Machine Learning, Lake Tahoe, CA, 1995. www.research.att.com/~wcohen/postscript/ml-95-ripper.ps and www.research.att.com/~wcohen/ripperd.html.
[54] W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR. ACM, 1996.
[55] R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, V. Z. G. B. Varile, A. Zampolli, et al., editors. Survey of the State of the Art in Human Language Technology. Cambridge University Press, National Science Foundation and European Commission, 1996. cslu.cse.ogi.edu IHLTsurvey/.
[56] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, second edition. McGraw-Hill, 2002.
[57] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.
[58] M. Craven, S. Slattery, and K. Nigam. First-order learning for (Web) mining. 10th European Conference on Machine Learning, pages 250–255, 1998. citeseer.nj.nec.com/craven98firstorder.html.
[59] D. R. Cutting, D. R. Karger, and J. O. Pedersen. Constant interaction-time scatter/gather browsing of very large document collections. In Annual International Conference on Research and Development in Information Retrieval (SIGIR), 1993.
[60] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Annual International Conference on Research and Development in Information Retrieval (SIGIR), Denmark, 1992.
[61] B. D. Davison. Topical locality in the Web. In Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000), pages 272–279. ACM, Athens, July 2000. www.cs.rutgers.edu/~davison/pubs/2000/sigir/.
[62] R. Dawkins. The Selfish Gene, second ed. Oxford University Press, 1989.
[63] P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proceedings of the 1st International World Wide Web Conference, Geneva, 1994. wwwl.cern.chlPapersWWW94/reinpost.ps.
[64] P. M. E. De Bra and R. D. J. Post. Searching for arbitrary information in the WWW: The fish search for Mosaic. In 2nd World Wide Web Conference ‘94: Mosaic and the Web, Chicago, October 1994. archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/debra/article.html and citeseer.nj.nec.com/172936.html.
[65] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. 8th World Wide Web Conference, Toronto, May 1999.
[66] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), pages 391–407, 1990. superbook.telcordia.com/~remde/lsi/papers/JASIS90.ps.
[67] S. J. DeRose. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1), pages 31–39, 1988.
[68] M. Dewey. Dewey Decimal Classification and Relative Index, 16th edition. Forest Press, 1958.
[69] M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, Proceedings of 26th International Conference on Very Large Data Bases (VLDB), September 10–14, 2000, Cairo, pages 527–534. Morgan Kaufmann, 2000. umnv.neci.nec.eom/~lawrence/papers/focus-vldb00/focus-vldb00.pdf.
[70] S. Dill, S. R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the Web. In VLDB, pages 69–78, Rome, September 2001. www.almaden.ibm.com/cs/k53/fractal.ps.
[71] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, pages 103–130, 1997.
[72] R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.
[73] S. T. Dumais. Using SVMs for text categorization. IEEE Intelligent Systems, 13(4), pages 21–23, July 1998.
[74] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In 7th Conference on Information and Knowledge Management, 1998. www.research.microsoft.com/~jplatt/cikm98.pdf.
[75] T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), pages 61–174, 1993.
[76] C. Faloutsos and K.-I. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In M.J. Carey and D. A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 163–74, San Jose, CA, 1995.
[77] G. W. Flake, S. Lawrence, C. Lee Giles, and F. M. Coetzee. Self-organization and identification of Web communities. IEEE Computer, 35(3), pages 66–71, 2002. www.neci.nec.com/~lawrence/papers/web-computer02/bib.html.
[78] R. Flesch. A new readability yardstick. Journal of Applied Psychology, 32, pages 221–233, 1948.
[79] D. Florescu, D. Kossman, and I. Manolescu. Integrating keyword searches into XML query processing. In WWW, vol. 9, pages 119–135, Amsterdam, May 2000. Elsevier Science. www9.org/w9cdrom/324/324.html.
[80] S. Fong and R. Berwick. Parsing with Principles and Parameters. The MIT Press, 1992.
[81] G. D. Forney, Jr. The Viterbi algorithm. In Proceedings of IEEE, 61 (3), pages 263–278, March 1973.
[82] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992.
[83] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs, fournal of Combinatorial Theory, B 44, pages 355–362, 1988.
[84] D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th National Conference on Artificial Intelligence, pages 517–523, 1998.
[85] J. H. Friedman. On bias, variance, 0/1 loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1(1), pages 55–77, 1997. Stanford University Technical Report, ftp: //playfair.stanford.edu/pub/friedman/curse.ps.Z.
[86] N. Fuhr and C. Buckley. A Probabilistic Learning Approach for Document Indexing. ACM Transactions on Information Systems, 9(3), pages 223–248, 1991.
[87] R. Garside and N. Smith. A hybrid grammatical tagger: CLAWS4. In R. Garside, G. Leech, and A. McEnery, editors, Corpus Annotation: Linguistic Information from Computer Text Corpora, pages 102–121. Longman, 1997. www.comp.lancs.ac.uk/computing/research/ucrel/claws/.
[88] G. Gazder and C. Mellish. Natural Language Processing in LISP. Addison-Wesley, 1989.
[89] D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In ACM Conference on Hypertext, pages 225–234, 1998.
[90] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518–529, 1999. citeseer.nj.nec.com/gionis97similarity.html.
[91] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1989.
[92] N. Govert, M. Lalmas, and N. Fuhr. A Probabilistic Description-Oriented Approach for Categorizing Web Documents. In CIKM, pages 475–482, 1999. citeseer.nj.nec.com/govert99probahilistic.html.
[93] S. G. Green. Building newspaper links in newspaper articles using semantic similarity. In Natural Language and Data Bases Conference, pages 178–190, 1997. citeseer.nj.nec.com/Stephen97building.html.
[94] S. Guiasu and A. Shenitzer. The principle of maximum entropy. The Mathematical Intelligencer, 7(1), pages 42–48, 1985.
[95] L. Haegeman. Introduction to Government and Binding Theory. Basil Blackwell Ltd., 1991.
[96] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[97] D. Hand, H. Manilla, and P. Smyth. Principles of Data Mining. The MIT Press, 2001.
[98] T. H. Haveliwala. Topic-sensitive PageRank, WWW, pages 517–526. ACM, Honolulu, May 2002. www2002.org/CDROM/refereed/127/index.html.
[99] D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 Web track. In E. Voorhees and D. Harman, editors, Proceedings of the 8th Text REtrieval Conference (TREC-8), NIST Special Publication 500–246, pages 131–150, 2000.
[100] M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. www.sims.berkeley.edu/~hearst/publications.shtml.
[101] M. Hearst and C. Karadi. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of the 20th Annual International ACM/SIGIR Conference, Philadelphia, July 1997. ftp://parcftp.xerox.com/pub/hearst/sigir97.ps.
[102] D. Heckerman. A tutorial on learning with Bayesian networks. In 12th International Conference on Machine Learning, Tahoe City, CA. July 1995. ftp://ftp.research.microsoft.com/pub/dtg/david/tutorial. PS and ftp: //ftp. research.microsoft.com/pub/tr/TR-95–06.PS.
[103] D. Heckerman. Bayesian networds for data mining. Data Mining and Knowledge Discovery, 1(1), 1997. ftp://jtp.research.microsojt.com/pub/dtg/david/tutorial. PS and ftp://ftp. research.microsofi.com/pub/tr/TR-95–06. PS.
[104] B. Hendrickson and R. W. Leland. A multi-level algorithm for partitioning graphs. In Supercomputing, 1995.
[105] M. R. Henzinger, α. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In WWW9, Amsterdam, May 2000. www9.org/w9cdrom/88/88.html.
[106] U. Hermjakob and R. J. Mooney. Learning parse and translation decisions from
examples with rich context. In P. R. Cohen and W. Wahlster, editors, In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 482–489, Somerset, NJ, 1997.
[107] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm—an application: Tailored Web site mapping. In WWW7, 1998. www7.scu.edu.au/programme/fullpapers/1849/coml849.htm.
[108] A. Heydon and M. Najork. Mercator: A scalable, extensible Web crawler. World Wide Web Conference, 2(4), pages 219–229, 1999.
[109] T. Hofmann. Probabilistic latent semantic analysis. In Uncertainty in Artifical Intelligence, 1999. www.es.brown.edu/people/th/publications.html.
[110] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, 1999. www.cs.brown.edu/people/th/publications.html.
[111] T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98–042, University of California, Berkeley, 1998.
[112] E. Hovy, L. Gerber, U. Hermjakob, M. Junk, and C.-Y. Lin. Question answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference (TREC-9). NIST, 2001. trec.nist.gov/pubs/trec9/papers/webclopedia.pdf.
[113] W.J. Hutchins and H. L. Somers. An Introduction to Machine Translation. Academic Press, 1992.
[114] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[115] W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pages 361–379. University of California Press, 1961.
[116] E. T. Jaynes. Notes on present status and future prospects. In W. T. Grandy and L. H. Schick, editors, Maximum Entropy and Bayesian Methods, pages 1–13. Kluwer, 1990.
[117] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Learning. The MIT Press, 1999. unvw-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99a.pdf.
[118] T. Joachims. A statistical learning model of text classification for support vector machines. In W. B. Croft, D.J. Harper, D. H. Kraft, and J. Zobel, editors, International Conference on Research and Development in Information Retrieval, vol. 24, pages 128–136. SIGIR, ACM, New Orleans, September 2001.
[119] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, no. 1398 in LNCS, pages 137–142, Chemnitz, Germany, 1998. Springer-Verlag.
[120] B. Kahle. Preserving the Internet. Scientific American, 276(3), pages 82–83, March 1997. www.sciam.com/0397issue/0397kahle.html and www.alexa.com/~brewster/essays/sciam_article.html.
[121] B. Katz. From sentence processing to information access on the World Wide Web. In AAAI Spring Symposium on Natural Language Processing for the World Wide Web, pages 77–94, Stanford, CA, 1997. Stanford University. www.ai.mit.edu/people/boris/webaccess/.
[122] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ10076(91892). www.cs.cornell.edu/home/kleinber/auth.ps.
[123] J. M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In ACM Symposium on Theory of Computing, pages 599–608, 1997.
[124] J. M. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In IEEE Symposium on Foundations of Computer Science, pages 14–23, 1999.
[125] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks (Special Issue on Neural Networks for Data Mining and Knowledge Discovery), 11(3), pages 574–585, May 2000. websom.hut.fi/websom/doc/publications.html.
[126] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In L. Saitta, editor, International Conference on Machine Learning, vol. 14. Morgan Kaufmann, 1997. robotics.stanford.edu/users/sahami/papers-dir/ml97-hier.ps.
[127] D. Koller and M. Sahami. Toward optimal feature selection. In L. Saitta, editor, International Conference on Machine Learning, vol. 13. Morgan Kaufmann, 1996.
[128] S. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. WWW8/Computer Networks, 31(11–16), pages 1481–1493, 1999. www8.org/w8-papers/4a-search-mining/trawling/trawling.html.
[129] K. S. Kumarvel. Automatic hypertext creation. M. Tech thesis, Computer Science and Engineering Department, IIT Bombay, 1997.
[130] J. Kupiec. MURAX: A robust linguistic approach for question answering using an online encyclopedia. In R. Korfhage, E. M. Rasmussen, and P. Willett, editors, SIGIR, pages 181–190. ACM, 1993.
[131] C. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the Web. In WWW, vol. 10, pages 150–161, Hong Kong, May 2001. IW3C2 and ACM. wwwl0.org/cdrom/papers/120/.
[132] R. Larson. Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Annual Meeting of the American Society for Information Science, 1996. sherlock.berkeley.edu/asis96/asis96.html.
[133] S. Lawrence and C. Lee Giles. Accessibility of information on the Web. Nature, 400, pages 107–109, July 1999.
[134] S. Lawrence and C. Lee Giles. Searching the World Wide Web. Science, 280, pages 98–100, April 1998.
[135] R. Lempel and S. Moran. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems (TOIS), 19(2), pages 131–160, April 2001. www.cs.technion.ac.il/~moranlr/PS/lm-feb01.ps.
[136] D. B. Lenat. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11), pages 32–38, November 1995. www.cyc.com/ and www.opencyc.org/.
[137] D. D. Lewis. Evaluating text categorization. In Proceedings of the Speech and Natural Language Workshop, pages 312–318. Morgan Kaufmann, 1991.
[138] D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nedellec and C. Rouveirol, editors, 10th European Conference on Machine Learning, pages 4–15, Chemnitz, Germany, April 1998. Springer.
[139] D. D. Lewis. The Reuters-21578 text categorization test collection, 1997. www.research.att.com/leuns/reuters21578.html.
[140] W.-S. Li, Q. Vu, D. Agrawal, Y. Hara, and H. Takano. PowerBookmarks: A system for personalizable Web information organization, sharing and management. Computer Networks, 31, May 1999. First appeared in the 8th International World Wide Web Conference, Toronto, May 1999. uww8.org/w8-papers/3b-web-doc/power/power.pdf.
[141] Y. S. Maarek and I. Z. Ben Shaul. Automatically organizing bookmarks per content. In 5th International World Wide Web Conference, Paris, May 1996.
[142] S. Macskassy, A. Banerjee, B. Davidson, and H. Hirsh. Human performance on clustering Web pages: A performance study. In Knowledge Discovery and Data Mining, vol. 4, pages 264–268, 1998.
[143] O. A. McBryan. GENVL and WWWW: Tools for taming the Web. In Proceedings of the First International World Wide Web Conference, pages 79–90, 1994.
[144] K. W. McCain. Core journal networks and cocitation maps in the marine sciences: Tools for information management in interdisciplinary research. In D. Shaw, editor, ASIS’92: Proceedings of the 55th ASIS Annual Meeting, pages 3–7, Medford, NJ, 1992. American Society for Information Science.
[145] A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press, 1998. Also Technical Report WS-98-05, CMU. www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf.
[146] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning techniques. In AAAI-99 Spring Symposium, 1999. www.cs.cmu.edu/~mccallum/papers/cora-aaaiss99.ps.gz.
[147] α. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In 15th International Conference on Machine Learning, pages 350–367, 1998. www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz.
[148] F. Menczer. Links tell us about lexical and semantic Web content. Technical Report Computer Science Abstract CS.IR/0108004, arXiv.org, August 2001. arxiv.org/abs/cs.IR/0108004.
[149] F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2/3), pages 203–242, 2000. Longer version available as Technical Report CS98-579, University of California, San Diego, dollar.biz.uiowa.edu/~fil/Papers/MLJ.ps.
[150] D. Meretakis and B. Wuthrich. Extending naive Bayes classifiers using long itemsets. In 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999.
[151] G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August 1993. ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.pdf.
[152] T. M. Mitchell. Conditions for the equivalence of hierarchical and flat Bayesian classifiers. Technical note, 1998. www.cs.cmu.edu/~tom/hierproof.ps.
[153] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[154] M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proceedings of RIAO-97, 5th International Conference “Recherche d’Information Assistee par Ordinateur,” pages 200–214, Montreal, Quebec, 1997.
[155] D. Mladenic. Feature subset selection in text-learning. In 10th European Conference on Machine Learning, vol. 1398, pages 95–100, 1998.
[156] F. Mosteller and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, 1964.
[157] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[158] M. Najork and J. Weiner. Breadth-first search crawling yields high-quality pages. In WWW 10, Hong Kong, May 2001. wwwl0.org/cdrom/papers/208.
[159] National Archives and Records Administration. Using the census soundex. General information leaflet 55. Washington, DC, 1995. Free brochure available from [email protected].
[160] T. Nelson. A file structure for the complex, the changing, and the indeterminate. In Proceedings of the ACM National Conference, pages 84–100, 1965.
[161] T. H. Nelson. Literary Machines. Mindful Press, 1982.
[162] A. Ng, A. Zheng, and M. Jordan. Stable algorithms for link analysis. In 24th Annual International ACM SIGIR Conference. ACM, New Orleans, September 2001. www.cs.berkeley.edu/~ang/.
[163] H. T. Ng and H. B. Lee. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In A. Joshi and M. Palmer, editors, Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, San Francisco, pages 40–47. Morgan Kaufmann, 1996.
[164] B. Nichols, D. Buttlar, and J. P. Farrell. Pthreads Programming. O’Reilly and Associates, 1996.
[165] J. Nielsen. Multimedia and Hypertext: The Internet and Beyond. Morgan Kaufmann, 1995. (Originally published by AP Professional.)
[166] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In 9th International Conference on Information and Knowledge Management (CIKM), 2000. www.cs.cmu.edu/~knigam.
[167] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI’99 Workshop on Information Filtering, 1999. www.cs.cmu.edu/~mcaalum/papers/maxent-ijcaiws99.ps.gz.
[168] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), pages 103–134, 2000. www-2.cs.cmu.edu/~mccallum/papers/emcat-mlj2000.ps.gz.
[169] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished manuscript, google.stanford.edu/~backrub/pageranksub.ps, 1998.
[170] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. JCSS, 61(2), pages 217–235, 2000. A preliminary version appeared in ACM PODS, pages 159–168, 1998.
[171] L. Pelkowitz. A continuous relaxation labeling algorithm for Markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20(3), pages 709–715, May 1990.
[172] D. M. Pennock, G. W. Flake, S. Lawrence, C. L. Giles, and E.J. Glover. Winners don’t take all: Characterizing the competition for links on the Web. Proceedings of the National Academy of Sciences, 2002. Preprint available: www.neci.nec.com/homepages/dpennock/publications.html.
[173] S. D. Pietra, V. D. Pietra, and J. Laferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), pages 380–393, April 1997.
[174] P. Pirolli, J. Pitkow, and R. Rao. Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In ACM CHI, 1996.
[175] J. Platt. Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74. The MIT Press, 1999. research.microsoft.com/~jplatt/SVMprob.ps.gz.
[176] J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research, 1998. www.research.microsoft.com/users/jplatt/smoTR.pdf.
[177] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275–281. ACM, 1998. cobar.cs.umass.edu/pubfiles/ir-120.ps.
[178] A. Popescul, L. H. Ungar, D. M. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI-2001), Seattle, WA, August 2001, pages 437–444. www.neci.nec.com/homepages/dpennock/publications.html.
[179] M. F. Porter. An algorithm for suffic stripping. Program, 14(3), pages 130–137, 1980.
[180] E. Rasmussen. Clustering algorithms. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structure and Algorithms, Chap. 16. Prentice Hall, 1992.
[181] J. Rennie and A. McCallum. Using reinforcement learning to spider the Web efficiently. In 16th International Conference on Machine Learning, pages 335–343, 1999. www.cs.cmu.edu/~mccallum/papers/rlspider-icml99s.ps.gz.
[182] J. Rissanen. Stochastic complexity in statistical inquiry. In World Scientific Series in Computer Science, vol. 15. World Scientific, 1989.
[183] S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR, pages 232–241, 1994.
[184] M. Sahami, M. Hearst, and E. Saund. Applying the multiple cause mixture model to text categorization. In L. Saitta, editor, International Conference on Machine Learning, vol. 13, pages 435–443. Morgan Kaufmann, 1996. robotics.stanford.edu/users/sahami/papers-dir/ml96-mcmm.ps.
[185] G. Salton. Automatic Text Processing. Addison-Wesley, 1989.
[186] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[187] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), pages 51–71, 1995.
[188] J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2), pages 155–170, March 1996.
[189] R. G. Schank and C. J. Rieger. Inference and computer understanding of natural language. In R. J. Brachman and H. J. Levesque (editors), Readings in Knowledge Representation, Morgan Kaufmann, 1985.
[190] B. Schölkopf and A. Smola. Learning with Kernels. The MIT Press, 2002.
[191] H. Schütze and C. Silverstein. A comparison of projections for efficient document clustering. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pages 74–81, July 1997. www-cs-students.stanford.edu/~csilvers/papers/metrics-sigir.ps.
[192] J. R. Seeley. The net of reciprocal influence: A problem in treating sociometric data. Canadian Journal of Psychology, 3, pages 234–240, 1949.
[193] K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37–42, 1999. www-2.cs.cmu.edu/~kseyrnore/papers/ie_aaai99.ps.gz.
[194] A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pages 129–139, 1999. www.es.huji.ac.il/~shashua/papersIfisher-NPL.pdf.
[195] A. Singhal and M. Kaszkiel. A case study in Web search using TREC algorithms. In WWW 10, Hong Kong, May 2001. wwwl0.org/cdrom/papers/317.
[196] D. Sleator and D. Temperley. Parsing English with a link grammar. Computer Science Technical Report CMU-CS-91–196, Carnegie Mellon University, October 1991. www.link.cs.cmu.edu/link/papers/index.html.
[197] P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 126–133, Portland, OR, August 1996. AAAI Press.
[198] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), pages 233–272, 1999. www.cs.washington.edu/homes/soderlan/WHISK.ps.
[199] J. F. Sowa. Conceptual Structures: Information Processing in Mind and Machines. Addison-Wesley, 1984.
[200] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management, 36(1–2): 1, pages 779–808, and 2, pages 809–840, 2000.
[201] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pages 197–206. University of California Press, 1955.
[202] W. R. Stevens. TCP/IP Illustrated: TCP for Transactions, HTTP, NNTP, and the UNIX Domain Protocols, vol.3. Addison-Wesley, 1996.
[203] D. Temperley. An introduction to link grammar parser. Technical report, April 1999. www.link.cs.cmu.edu/link/dict/introduction.html.
[204] H. R. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), pages 187–222, 1991.
[205] U. N. U. Institute of Advanced Studies. The Universal Networking Language: Specification document. Internal Technical Document, 1999. www.unl.ias.unu.edu/.
[206] H. Uchida, M. Zhu, and T. D. Senta. The UNL: A gift for a millennium. Institute of Advanced Studies, United Nations University, pages 53–67, Tokyo, November 1999. www.unl.ias.unu.edu/.
[207] S. Vaithyanathan and B. Dom. Generalized model selection for unsupervised learning in high dimensions. In Neural Information Processing Systems (NIPS), Denver, CO, 1999. www.almaden.ibm.com/cs/k53/papers/nips99.ps.
[208] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11), pages 1134–1142, 1984.
[209] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
[210] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
[211] R. Weiss, B. Velez, M. A. Sheldon, C. Nemprempre, P. Szilagyi, A. Duda, and D. K. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the 7th ACM Conference on Hypertext, Washington, DC, March 1996.
[212] S. Weiss and N. Indurkhya. Optimized rule induction. IEEE Expert, 8(6), pages 61–69, 1993.
[213] P. Willett. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5), 1988.
[214] T. Winograd. Language as a Cognitive Process, Vol. 1: Syntax. Addison-Wesley, 1983.
[215] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Multimedia Information and Systems. Morgan Kaufmann, 1999.
[216] XTAG Research Group. A lexicalized tree adjoining grammar for English. Technical Report IRCS-01-03. IRCS, University of Pennsylvania, 2001.
[217] Y. Yang and X. Liu. A re-examination of text categorization methods. In Annual International Conference on Research and Development in Information Retrieval (SIGIR), pages 42–49. ACM, 1999. www-2.cs.cmu.edu/~yiming/publications.html.
[218] Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In International Conference on Machine Learning, pages 412–420, 1997.
[219] J. Yi and N. Sundaresan. A classifier for semi-structured documents. In KDD 2000, pages 340–344. ACM SIGKDD, Boston, August 2000.
[220] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD, pages 103–114. ACM, 1996. www.ece.nwu.edu/~harsha/Clustering/sigmodpaper.ps.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.247