References
To make references as accessible as possible, we have included URLs where possible. Unfortunately, over time, Web sites disappear, accounts are reorganized, and hyperlinks break. If you find a broken link, you may be able to find the paper by pasting the title and author names into a general search engine such as umnv.google.com or an academic paper collection such as citeseer.com, or pasting the broken URL into the Wayback Machine at www.archive.org.
[3] J. Allan. Automatic hypertext link typing. In 7th ACM Conference on Hypertext, Hypertext ‘96, pages 42–51, 1996.
[4] J. Allen. Natural Language Understanding. Benjamin Cummings, 1987, 1995.
[5] B. Amento, L. G. Terveen, and W. C. Hill. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In SIGIR, pages 296–303. ACM, 2000. citeseer.nj.nec.com/417258.html.
[6] E. L. Antworth. PC-KIMMO: A two-level processor for morphological analysis. Summer Institute of Linguistics, International Academic Bookstore, Dallas, 1990.
www.sil.org/pckimmo/pc-kimmo.html.
[7] C. Apte, E Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994. Also published as IBM Research Report RC18879.
[8] D. J. Arnold, L. Balkan, R. L. Humphreys, S. Meijer, and L. Sadler. Machine translation: An introductory guide, 1995.
clwuw.essex.ac.uk/~doug/book/book.html, cluww.essex.ac.uk/MTbook/, www.essex.ac.uk/linguistics/clmt/MTbook/.
[9] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spectral analysis for data Mining. In STOC, vol. 33, pages 619–626, 2001.
[11] F. Bacon. The Advancement of Learning. Clarendon Press, 1873.
[12] Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about Web pages via random walks. In
Proceedings of the 26th International Conference on Very Large Databases (VLDB), pages 535–544, 2000.
www.cs.berkeley.edu/~zivi/papers/webwalker/webwalker.ps.gz.
[13] A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286, pages 509–512, 1999.
[14] A. Berg. Random jumps in Web Walker. Personal communication, April 2001.
[18] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In
21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104–111, August 1998.
www.henzinger.com/monika/mpapers/sigir98_ps.ps.
[19] S. Blackmore. The Meme Machine. Oxford University Press, 1999.
[21] W.J. Bluestein. Hypertext versions of journal articles: Computer aided linking and realistic human evaluation. Ph.D. thesis, University of Western Ontario, 1999.
[22] A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training. Computational Learning Theory, pages 92–100, 1998.
[24] B. E. Brewington and G. Cybenko. Keeping up with the changing Web. IEEE Computer, 33(5), pages 52–58, 2000.
[25] E. Brill. A simple rule-based part of speech tagger. In
Proceedings of the 3rd Conference on Applied Natural Language Processing, pages 152–155, 1992.
www.cs.jhu.edu/~brill/acadpubs.html and
citeseer.nj.nec.com/brill92simple.html.
[26] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th World Wide Web Conference (WWW7), 1998. decweb.ethz.ch/WWW7/1921/com 1921.htm.
[27] A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference, pages 391–404, April 1997. Also appeared as SRC Technical Note 1997–015; see research.compaq.com/SRC/WebArcheology/syntactic.html.
[28] A. Bröder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web: Experiments and models. In
WWW9, pages 309–320, Amsterdam, May 2000. Elsevier Science,
www9.org/w9cdrom/160/160.html.
[30] M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), pages 328–334, July 1999.
[31] C. Cardie and D. Pierce. The role of lexicalization and pruning for base noun phrase grammars. In AAA1 99, pages 423–430, July 1999.
[32] B. Carlin and T. Louis.
Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall, 1996.
[34] N. Catenazzi and F. Gibb. The publishing process: The hyperbook approach. Journal of Information Science, 21(3), pages 161–172, 1995.
[36] S. Chakrabarti and B. E. Dom. Feature diffusion across hyperlinks. U.S. Patent No. 6,125,361, April 1998. IBM Corp.
[37] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies.
VLDB Journal, August 1998.
www.cse.iitb.ac.in/~soumen/doc/vldbj1998/.
[38] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. In
7th World Wide Web Conference (WWW7), 1998.
www7.scu.edu.au/programme/fullpapers/1898/com1898.html.
[40] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web’s link structure. IEEE Computer, 32(8), pages 60–67, August 1999.
[41] S. Chakrabarti, D. A. Gibson, and K. S. McCurley. Surfing the Web backwards. In WWW, vol 8, Toronto, May 1999.
[42] S. Chakrabarti, M. M. Joshi, and V. B. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In SIGIR, vol. 24. ACM, New Orleans, September 2001.
[44] S. Chakrabarti, S. Srivastava, M. Subramanyam, and M. Tiwari. Using Memex to archive and mine community Web browsing experience.
Computer Networks, 33(1–6), pages 669–684, May 2000.
www9.org/w9cdrom/98/98.html.
[45] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery.
Computer Networks, 31, pages 1623–1640, 1999. First appeared in the 8th International World Wide Web Conference, Toronto, May 1999.
www8.org/w8-papers/5a-search-query/crawling/.
[46] E. Charniak. A maximum-entropy-inspired parser. Computer Science Technical Report CS-99–12, Brown University, August 1999.
www.cs.hrown.edu/people/ec/.
[47] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, 1996. ic.arc.nasa.gov/ic/projects/bayes-group/images/kdd-95.ps.
[48] C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In 6th World Wide Web Conference, San Jose, CA, 1996.
[49] R. Chellappa and A. Jain. Markov random fields: Theory and applications. Academic Press, 1993.
[52] E. Cohen and D. D. Lewis. Approximating matrix multiplication for pattern recognition tasks.
Journal of Algorithms, 30, pages 211–252, 1999. Special issue of selected papers from SODA’97.
www.research.att.com/~edith/publications.html.
[54] W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In
SIGIR. ACM, 1996.
[55] R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, V. Z. G. B. Varile, A. Zampolli, et al., editors. Survey of the State of the Art in Human Language Technology. Cambridge University Press, National Science Foundation and European Commission, 1996. cslu.cse.ogi.edu IHLTsurvey/.
[56] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, second edition. McGraw-Hill, 2002.
[57] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.
[58] M. Craven, S. Slattery, and K. Nigam. First-order learning for (Web) mining. 10th European Conference on Machine Learning, pages 250–255, 1998. citeseer.nj.nec.com/craven98firstorder.html.
[59] D. R. Cutting, D. R. Karger, and J. O. Pedersen. Constant interaction-time scatter/gather browsing of very large document collections. In Annual International Conference on Research and Development in Information Retrieval (SIGIR), 1993.
[60] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Annual International Conference on Research and Development in Information Retrieval (SIGIR), Denmark, 1992.
[61] B. D. Davison. Topical locality in the Web. In
Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000), pages 272–279. ACM, Athens, July 2000.
www.cs.rutgers.edu/~davison/pubs/2000/sigir/.
[62] R. Dawkins. The Selfish Gene, second ed. Oxford University Press, 1989.
[63] P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In
Proceedings of the 1st International World Wide Web Conference, Geneva, 1994.
wwwl.cern.chlPapersWWW94/reinpost.ps.
[64] P. M. E. De Bra and R. D. J. Post. Searching for arbitrary information in the WWW: The fish search for Mosaic. In 2nd World Wide Web Conference ‘94: Mosaic and the Web, Chicago, October 1994. archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/debra/article.html and citeseer.nj.nec.com/172936.html.
[65] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web.
8th World Wide Web Conference, Toronto, May 1999.
[66] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), pages 391–407, 1990. superbook.telcordia.com/~remde/lsi/papers/JASIS90.ps.
[67] S. J. DeRose. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1), pages 31–39, 1988.
[68] M. Dewey. Dewey Decimal Classification and Relative Index, 16th edition. Forest Press, 1958.
[69] M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, Proceedings of 26th International Conference on Very Large Data Bases (VLDB), September 10–14, 2000, Cairo, pages 527–534. Morgan Kaufmann, 2000. umnv.neci.nec.eom/~lawrence/papers/focus-vldb00/focus-vldb00.pdf.
[70] S. Dill, S. R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the Web. In
VLDB, pages 69–78, Rome, September 2001.
www.almaden.ibm.com/cs/k53/fractal.ps.
[71] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, pages 103–130, 1997.
[72] R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.
[73] S. T. Dumais. Using SVMs for text categorization. IEEE Intelligent Systems, 13(4), pages 21–23, July 1998.
[74] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In
7th Conference on Information and Knowledge Management, 1998.
www.research.microsoft.com/~jplatt/cikm98.pdf.
[75] T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), pages 61–174, 1993.
[76] C. Faloutsos and K.-I. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In M.J. Carey and D. A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 163–74, San Jose, CA, 1995.
[78] R. Flesch. A new readability yardstick. Journal of Applied Psychology, 32, pages 221–233, 1948.
[79] D. Florescu, D. Kossman, and I. Manolescu. Integrating keyword searches into XML query processing. In
WWW, vol. 9, pages 119–135, Amsterdam, May 2000. Elsevier Science.
www9.org/w9cdrom/324/324.html.
[80] S. Fong and R. Berwick. Parsing with Principles and Parameters. The MIT Press, 1992.
[81] G. D. Forney, Jr. The Viterbi algorithm. In Proceedings of IEEE, 61 (3), pages 263–278, March 1973.
[82] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992.
[83] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs, fournal of Combinatorial Theory, B 44, pages 355–362, 1988.
[84] D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th National Conference on Artificial Intelligence, pages 517–523, 1998.
[85] J. H. Friedman. On bias, variance, 0/1 loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1(1), pages 55–77, 1997. Stanford University Technical Report, ftp: //playfair.stanford.edu/pub/friedman/curse.ps.Z.
[86] N. Fuhr and C. Buckley. A Probabilistic Learning Approach for Document Indexing. ACM Transactions on Information Systems, 9(3), pages 223–248, 1991.
[87] R. Garside and N. Smith. A hybrid grammatical tagger: CLAWS4. In R. Garside, G. Leech, and A. McEnery, editors,
Corpus Annotation: Linguistic Information from Computer Text Corpora, pages 102–121. Longman, 1997.
www.comp.lancs.ac.uk/computing/research/ucrel/claws/.
[88] G. Gazder and C. Mellish. Natural Language Processing in LISP. Addison-Wesley, 1989.
[89] D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In ACM Conference on Hypertext, pages 225–234, 1998.
[90] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In
VLDB, pages 518–529, 1999.
citeseer.nj.nec.com/gionis97similarity.html.
[91] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1989.
[92] N. Govert, M. Lalmas, and N. Fuhr. A Probabilistic Description-Oriented Approach for Categorizing Web Documents. In CIKM, pages 475–482, 1999. citeseer.nj.nec.com/govert99probahilistic.html.
[93] S. G. Green. Building newspaper links in newspaper articles using semantic similarity. In Natural Language and Data Bases Conference, pages 178–190, 1997. citeseer.nj.nec.com/Stephen97building.html.
[94] S. Guiasu and A. Shenitzer. The principle of maximum entropy. The Mathematical Intelligencer, 7(1), pages 42–48, 1985.
[95] L. Haegeman. Introduction to Government and Binding Theory. Basil Blackwell Ltd., 1991.
[96] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[97] D. Hand, H. Manilla, and P. Smyth. Principles of Data Mining. The MIT Press, 2001.
[99] D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 Web track. In E. Voorhees and D. Harman, editors, Proceedings of the 8th Text REtrieval Conference (TREC-8), NIST Special Publication 500–246, pages 131–150, 2000.
[101] M. Hearst and C. Karadi. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of the 20th Annual International ACM/SIGIR Conference, Philadelphia, July 1997. ftp://parcftp.xerox.com/pub/hearst/sigir97.ps.
[102] D. Heckerman. A tutorial on learning with Bayesian networks. In 12th International Conference on Machine Learning, Tahoe City, CA. July 1995. ftp://ftp.research.microsoft.com/pub/dtg/david/tutorial. PS and ftp: //ftp. research.microsoft.com/pub/tr/TR-95–06.PS.
[103] D. Heckerman. Bayesian networds for data mining.
Data Mining and Knowledge Discovery, 1(1), 1997.
ftp://jtp.research.microsojt.com/pub/dtg/david/tutorial. PS and
ftp://ftp. research.microsofi.com/pub/tr/TR-95–06. PS.
[104] B. Hendrickson and R. W. Leland. A multi-level algorithm for partitioning graphs. In Supercomputing, 1995.
[105] M. R. Henzinger, α. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In
WWW9, Amsterdam, May 2000.
www9.org/w9cdrom/88/88.html.
[106] U. Hermjakob and R. J. Mooney. Learning parse and translation decisions from
examples with rich context. In P. R. Cohen and W. Wahlster, editors, In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 482–489, Somerset, NJ, 1997.
[108] A. Heydon and M. Najork. Mercator: A scalable, extensible Web crawler. World Wide Web Conference, 2(4), pages 219–229, 1999.
[111] T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98–042, University of California, Berkeley, 1998.
[112] E. Hovy, L. Gerber, U. Hermjakob, M. Junk, and C.-Y. Lin. Question answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference (TREC-9). NIST, 2001. trec.nist.gov/pubs/trec9/papers/webclopedia.pdf.
[113] W.J. Hutchins and H. L. Somers. An Introduction to Machine Translation. Academic Press, 1992.
[114] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[115] W. James and C. Stein. Estimation with quadratic loss. In
Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pages 361–379. University of California Press, 1961.
[116] E. T. Jaynes. Notes on present status and future prospects. In W. T. Grandy and L. H. Schick, editors, Maximum Entropy and Bayesian Methods, pages 1–13. Kluwer, 1990.
[117] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Learning. The MIT Press, 1999. unvw-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99a.pdf.
[118] T. Joachims. A statistical learning model of text classification for support vector machines. In W. B. Croft, D.J. Harper, D. H. Kraft, and J. Zobel, editors, International Conference on Research and Development in Information Retrieval, vol. 24, pages 128–136. SIGIR, ACM, New Orleans, September 2001.
[119] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, no. 1398 in LNCS, pages 137–142, Chemnitz, Germany, 1998. Springer-Verlag.
[121] B. Katz. From sentence processing to information access on the World Wide Web. In
AAAI Spring Symposium on Natural Language Processing for the World Wide Web, pages 77–94, Stanford, CA, 1997. Stanford University.
www.ai.mit.edu/people/boris/webaccess/.
[122] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In
Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ10076(91892).
www.cs.cornell.edu/home/kleinber/auth.ps.
[123] J. M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In ACM Symposium on Theory of Computing, pages 599–608, 1997.
[124] J. M. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In IEEE Symposium on Foundations of Computer Science, pages 14–23, 1999.
[125] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, V. Paatero, and A. Saarela. Self organization of a massive document collection.
IEEE Transactions on Neural Networks (Special Issue on Neural Networks for Data Mining and Knowledge Discovery), 11(3), pages 574–585, May 2000.
websom.hut.fi/websom/doc/publications.html.
[126] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In L. Saitta, editor, International Conference on Machine Learning, vol. 14. Morgan Kaufmann, 1997. robotics.stanford.edu/users/sahami/papers-dir/ml97-hier.ps.
[127] D. Koller and M. Sahami. Toward optimal feature selection. In L. Saitta, editor, International Conference on Machine Learning, vol. 13. Morgan Kaufmann, 1996.
[129] K. S. Kumarvel. Automatic hypertext creation. M. Tech thesis, Computer Science and Engineering Department, IIT Bombay, 1997.
[130] J. Kupiec. MURAX: A robust linguistic approach for question answering using an online encyclopedia. In R. Korfhage, E. M. Rasmussen, and P. Willett, editors, SIGIR, pages 181–190. ACM, 1993.
[131] C. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the Web. In
WWW, vol. 10, pages 150–161, Hong Kong, May 2001. IW3C2 and ACM.
wwwl0.org/cdrom/papers/120/.
[132] R. Larson. Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Annual Meeting of the American Society for Information Science, 1996. sherlock.berkeley.edu/asis96/asis96.html.
[133] S. Lawrence and C. Lee Giles. Accessibility of information on the Web. Nature, 400, pages 107–109, July 1999.
[134] S. Lawrence and C. Lee Giles. Searching the World Wide Web. Science, 280, pages 98–100, April 1998.
[136] D. B. Lenat. Cyc: A large-scale investment in knowledge infrastructure.
Communications of the ACM, 38(11), pages 32–38, November 1995.
www.cyc.com/ and
www.opencyc.org/.
[137] D. D. Lewis. Evaluating text categorization. In
Proceedings of the Speech and Natural Language Workshop, pages 312–318. Morgan Kaufmann, 1991.
[138] D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nedellec and C. Rouveirol, editors, 10th European Conference on Machine Learning, pages 4–15, Chemnitz, Germany, April 1998. Springer.
[140] W.-S. Li, Q. Vu, D. Agrawal, Y. Hara, and H. Takano. PowerBookmarks: A system for personalizable Web information organization, sharing and management. Computer Networks, 31, May 1999. First appeared in the 8th International World Wide Web Conference, Toronto, May 1999. uww8.org/w8-papers/3b-web-doc/power/power.pdf.
[141] Y. S. Maarek and I. Z. Ben Shaul. Automatically organizing bookmarks per content. In 5th International World Wide Web Conference, Paris, May 1996.
[142] S. Macskassy, A. Banerjee, B. Davidson, and H. Hirsh. Human performance on clustering Web pages: A performance study. In Knowledge Discovery and Data Mining, vol. 4, pages 264–268, 1998.
[143] O. A. McBryan. GENVL and WWWW: Tools for taming the Web. In Proceedings of the First International World Wide Web Conference, pages 79–90, 1994.
[144] K. W. McCain. Core journal networks and cocitation maps in the marine sciences: Tools for information management in interdisciplinary research. In D. Shaw, editor, ASIS’92: Proceedings of the 55th ASIS Annual Meeting, pages 3–7, Medford, NJ, 1992. American Society for Information Science.
[145] A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In
AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press, 1998. Also Technical Report WS-98-05, CMU.
www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf.
[147] α. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In
15th International Conference on Machine Learning, pages 350–367, 1998.
www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz.
[148] F. Menczer. Links tell us about lexical and semantic Web content. Technical Report Computer Science Abstract CS.IR/0108004, arXiv.org, August 2001. arxiv.org/abs/cs.IR/0108004.
[149] F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2/3), pages 203–242, 2000. Longer version available as Technical Report CS98-579, University of California, San Diego, dollar.biz.uiowa.edu/~fil/Papers/MLJ.ps.
[150] D. Meretakis and B. Wuthrich. Extending naive Bayes classifiers using long itemsets. In 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999.
[151] G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August 1993. ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.pdf.
[153] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[154] M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proceedings of RIAO-97, 5th International Conference “Recherche d’Information Assistee par Ordinateur,” pages 200–214, Montreal, Quebec, 1997.
[155] D. Mladenic. Feature subset selection in text-learning. In 10th European Conference on Machine Learning, vol. 1398, pages 95–100, 1998.
[156] F. Mosteller and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, 1964.
[157] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[158] M. Najork and J. Weiner. Breadth-first search crawling yields high-quality pages. In
WWW 10, Hong Kong, May 2001.
wwwl0.org/cdrom/papers/208.
[159] National Archives and Records Administration. Using the census soundex. General information leaflet 55. Washington, DC, 1995. Free brochure available from
[email protected].
[160] T. Nelson. A file structure for the complex, the changing, and the indeterminate. In Proceedings of the ACM National Conference, pages 84–100, 1965.
[161] T. H. Nelson. Literary Machines. Mindful Press, 1982.
[162] A. Ng, A. Zheng, and M. Jordan. Stable algorithms for link analysis. In
24th Annual International ACM SIGIR Conference. ACM, New Orleans, September 2001.
www.cs.berkeley.edu/~ang/.
[163] H. T. Ng and H. B. Lee. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In A. Joshi and M. Palmer, editors, Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, San Francisco, pages 40–47. Morgan Kaufmann, 1996.
[164] B. Nichols, D. Buttlar, and J. P. Farrell. Pthreads Programming. O’Reilly and Associates, 1996.
[165] J. Nielsen. Multimedia and Hypertext: The Internet and Beyond. Morgan Kaufmann, 1995. (Originally published by AP Professional.)
[166] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In
9th International Conference on Information and Knowledge Management (CIKM), 2000.
www.cs.cmu.edu/~knigam.
[169] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished manuscript, google.stanford.edu/~backrub/pageranksub.ps, 1998.
[170] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis.
JCSS, 61(2), pages 217–235, 2000. A preliminary version appeared in
ACM PODS, pages 159–168, 1998.
[171] L. Pelkowitz. A continuous relaxation labeling algorithm for Markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20(3), pages 709–715, May 1990.
[172] D. M. Pennock, G. W. Flake, S. Lawrence, C. L. Giles, and E.J. Glover. Winners don’t take all: Characterizing the competition for links on the Web.
Proceedings of the National Academy of Sciences, 2002. Preprint available:
www.neci.nec.com/homepages/dpennock/publications.html.
[173] S. D. Pietra, V. D. Pietra, and J. Laferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), pages 380–393, April 1997.
[174] P. Pirolli, J. Pitkow, and R. Rao. Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In ACM CHI, 1996.
[175] J. Platt. Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74. The MIT Press, 1999. research.microsoft.com/~jplatt/SVMprob.ps.gz.
[177] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275–281. ACM, 1998. cobar.cs.umass.edu/pubfiles/ir-120.ps.
[178] A. Popescul, L. H. Ungar, D. M. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In
Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI-2001), Seattle, WA, August 2001, pages 437–444.
www.neci.nec.com/homepages/dpennock/publications.html.
[179] M. F. Porter. An algorithm for suffic stripping. Program, 14(3), pages 130–137, 1980.
[180] E. Rasmussen. Clustering algorithms. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structure and Algorithms, Chap. 16. Prentice Hall, 1992.
[182] J. Rissanen. Stochastic complexity in statistical inquiry. In World Scientific Series in Computer Science, vol. 15. World Scientific, 1989.
[183] S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR, pages 232–241, 1994.
[184] M. Sahami, M. Hearst, and E. Saund. Applying the multiple cause mixture model to text categorization. In L. Saitta, editor, International Conference on Machine Learning, vol. 13, pages 435–443. Morgan Kaufmann, 1996. robotics.stanford.edu/users/sahami/papers-dir/ml96-mcmm.ps.
[185] G. Salton. Automatic Text Processing. Addison-Wesley, 1989.
[186] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[187] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), pages 51–71, 1995.
[188] J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2), pages 155–170, March 1996.
[189] R. G. Schank and C. J. Rieger. Inference and computer understanding of natural language. In R. J. Brachman and H. J. Levesque (editors), Readings in Knowledge Representation, Morgan Kaufmann, 1985.
[190] B. Schölkopf and A. Smola. Learning with Kernels. The MIT Press, 2002.
[192] J. R. Seeley. The net of reciprocal influence: A problem in treating sociometric data. Canadian Journal of Psychology, 3, pages 234–240, 1949.
[193] K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In
Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37–42, 1999.
www-2.cs.cmu.edu/~kseyrnore/papers/ie_aaai99.ps.gz.
[195] A. Singhal and M. Kaszkiel. A case study in Web search using TREC algorithms. In
WWW 10, Hong Kong, May 2001.
wwwl0.org/cdrom/papers/317.
[197] P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 126–133, Portland, OR, August 1996. AAAI Press.
[199] J. F. Sowa. Conceptual Structures: Information Processing in Mind and Machines. Addison-Wesley, 1984.
[200] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management, 36(1–2): 1, pages 779–808, and 2, pages 809–840, 2000.
[201] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pages 197–206. University of California Press, 1955.
[202] W. R. Stevens. TCP/IP Illustrated: TCP for Transactions, HTTP, NNTP, and the UNIX Domain Protocols, vol.3. Addison-Wesley, 1996.
[204] H. R. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), pages 187–222, 1991.
[205] U. N. U. Institute of Advanced Studies. The Universal Networking Language: Specification document.
Internal Technical Document, 1999.
www.unl.ias.unu.edu/.
[206] H. Uchida, M. Zhu, and T. D. Senta.
The UNL: A gift for a millennium. Institute of Advanced Studies, United Nations University, pages 53–67, Tokyo, November 1999.
www.unl.ias.unu.edu/.
[207] S. Vaithyanathan and B. Dom. Generalized model selection for unsupervised learning in high dimensions. In
Neural Information Processing Systems (NIPS), Denver, CO, 1999.
www.almaden.ibm.com/cs/k53/papers/nips99.ps.
[208] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11), pages 1134–1142, 1984.
[209] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
[210] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
[211] R. Weiss, B. Velez, M. A. Sheldon, C. Nemprempre, P. Szilagyi, A. Duda, and D. K. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the 7th ACM Conference on Hypertext, Washington, DC, March 1996.
[212] S. Weiss and N. Indurkhya. Optimized rule induction. IEEE Expert, 8(6), pages 61–69, 1993.
[213] P. Willett. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5), 1988.
[214] T. Winograd. Language as a Cognitive Process, Vol. 1: Syntax. Addison-Wesley, 1983.
[215] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Multimedia Information and Systems. Morgan Kaufmann, 1999.
[216] XTAG Research Group. A lexicalized tree adjoining grammar for English. Technical Report IRCS-01-03. IRCS, University of Pennsylvania, 2001.
[217] Y. Yang and X. Liu. A re-examination of text categorization methods. In
Annual International Conference on Research and Development in Information Retrieval (SIGIR), pages 42–49. ACM, 1999.
www-2.cs.cmu.edu/~yiming/publications.html.
[218] Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In
International Conference on Machine Learning, pages 412–420, 1997.
[219] J. Yi and N. Sundaresan. A classifier for semi-structured documents. In KDD 2000, pages 340–344. ACM SIGKDD, Boston, August 2000.