• Android Droid Surprise

    Updated: 2010-04-30 07:09:12
    Short honk: “Google Tells Verizon Users to Buy HTC Incredible over Nexus One” reported a surprising development. The passage I noted was: “When asked [by a reporter] for an elaboration on the word change, Google added the following: “We won’t be selling a Nexus One with Verizon, and this is a reflection of the amazing [...]

  • Now It Is the Collaborative Enterprise

    Updated: 2010-04-30 07:04:53
    A bit of clicking around surfaced “The Collaborative-Ready Enterprise.” The write up focuses on the benefits of “communication, coordination, communities and social interaction facilitation”. The idea is that video conferencing, instant messaging, similar functions allow a business function like customer support to reduce cost and improve customer satisfaction. A conversation at breakfast this morning took a [...]

  • Oracle Acceleration with Sun Methods

    Updated: 2010-04-30 06:05:01
    Companies with big investments in Oracle face the same tough choices that bedeviled me when using Sun Microsystems hardware for The Point (Top 5% of the Internet) in 1993. To make Sun stuff go fast, one needed to keep the Sun system pure; that is, only Sun approved goodies were to be used. Each goodie [...]

  • Google and the Problematic MACs

    Updated: 2010-04-30 06:04:00
    The article “Google Defends Street View Wi-Fi Data Collection” has a killer passage. This is the segment that made it into my handwritten notes: Peter Fleischer, global privacy counsel for Google, countered this in a blog post, saying that the firm does not believe that collecting Wi-Fi network information is illegal. “This is all publicly broadcast [...]

  • Yahoo and Search Models

    Updated: 2010-04-30 06:01:03
    I received an email this morning pointing to the strong showing of Google at the recent Web conference in Raleigh, North Carolina. I responded that Yahoo continues to push forward with what seem to me academic-type initiatives. In terms of traffic and revenue growth, I am waiting for some real action to take place. After writing [...]

  • Kindle Forced in a Nook

    Updated: 2010-04-29 10:59:09
    Short honk: I mentioned at a break in the Boston search conference that I heard the Barnes & Noble Nook was outselling the Amazon Kindle. Several people expressed surprise. I did a quick online check and the factoid appears in “Nook Outsells Kindle in March, E-Reader Sales Expected to Hit 11 Million.” Interesting if spot [...]

  • Small and Mid Sized Businesses: Growth and Search

    Updated: 2010-04-29 10:59:08
    Small Business Computing ran “Microsoft SMB Specialists See 2010 Spending Rise.” Microsoft as azure chip consultants? Why not? The write up contained some quite interesting assertions about the future. Well, it was Microsoft’s business partners who were the source of the survey sample. And, to be fair, the survey did not consider the spill over [...]

  • Open, Closed, and Information Access

    Updated: 2010-04-29 10:59:07
    I continue to hear comments about the importance of open source software. Two or three years ago, knowledge of open source search technologies was confined to some specialist groups. Today open source is generally understood and open source search implementations can be found in large and small organizations. The article “The Tradeoff between Open and [...]

  • Is Apple in the Search Business?

    Updated: 2010-04-29 10:59:05
    I read several posts about Apple’s acquisition of Siri, a maker of software that “understands what you say, accomplishes tasks for you, and adapts to your preferences over time.” The software promises a great deal, but like most smart systems, Siri has some glitches. These may have less to do with technology and more to [...]

  • CI Fellows program renewed

    Updated: 2010-04-29 02:58:08
    Lev Reyzin points out the CI Fellows program is renewed. CI Fellows are essentially NSF funded computer science postdocs for universities and industry research labs. I’ve been lucky and happy to have Lev visit me for a year under last year’s program, so I strongly recommend participating if it suits you. As with [...]

  • I’ll be speaking in Washington, DC on May 6

    Updated: 2010-04-18 15:48:15
    My clients at Aster Data are putting on a sequence of conferences called “Big Data Summit(s)”, and wanted me to keynote one. I agreed to the one in Washington, DC, on May 6, on the condition that I would be allowed to start with the same liberty and privacy themes I started my New England [...]

  • MLcomp: a website for objectively comparing ML algorithms

    Updated: 2010-04-15 05:56:55
    Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We (Percy and Jake) believe there are currently a number of shortcomings: Incomplete [...]

  • Terrier/FIRE

    Updated: 2010-04-11 20:36:52
    Information Retrieval Wiki : Search Login FrontPage RecentChanges FindPage HelpContents Terrier FIRE Show Parent Immutable Page Show Changes Get Info More : Actions Show Raw Text Show Print View Delete Cache Attachments Check Spelling Show Like Pages Show Local Site Map Rename Page Delete Page Terrier FIRE FIRE Data Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages Documents in Bengali , Hindi , Marathi and Queries in Bengali , Hindi , Marathi , Tamil , Telugu , Malayalam , Gujarati etc . Terrier is a great choice of retrieval system for FIRE . However , it needs careful configuration , and a few code changes . Below , we detail the code changes , and a recommended configuration . Code Changes Some changes in the code are necessary for indexing retrieving FIRE data .

  • Jeff's Search Engine Caffè February 28, 2010

    Updated: 2010-04-08 12:47:34
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , March 4 Semantic Search Competition Peter Mika highlights the Semantic Search competition at the upcoming Semantic Search 2010 workshop at WWW 2010 From Peter's post Participants will be given queries sampled from a web search query log provided by the Yahoo Webscope program , and have to try to answer those queries using the Billion Triples Challenge corpus from 2009. The queries that are selected are all entity queries in that they are looking to find information about a single . entity This is an interesting competition because it attempts to use unstructured web queries to do retrieval over a heterogeneous collection of structured data . The Billion Triples collection contains data from DBpedia extracted from Wikipedia Geonames , a variety of social networks , and other . sources There's a group of us here working on an entry we'll see how it . goes Posted by jeff.dalton at 10:40 AM 0 comments Links to this post Newer Posts Older Posts Home Subscribe to : Posts Atom About Me Jeff Dalton I'm a Comp Sci grad student in the PhD program at the CIIR at UMass Amherst .

  • Michael Bendersky CIIR

    Updated: 2010-04-08 12:47:25
    Michael Bendersky Home Publications Data About me I am currently a 3rd year PhD student at the Center for Intelligent Information Retrieval Department of Computer Science , University of Massachusetts Amherst . I am broadly interested in theoretical information retrieval models and their practical applications . My current research focuses on studying and improving retrieval with verbose natural language queries . This research combines insights from information retrieval , natural language processing and statistical machine learning , and has a potential to revolutionize the way people search on the web , in the enterprise and on mobile devices . A detailed CV My LinkedIn Profile Recent Publications M . Bendersky , E . Gabrilovich , V . Josifovski and D . : Metzler The Anatomy of an Ad : Structured Indexing and Retrieval for Sponsored Search In Proceedings of WWW 2010 To appear M . Bendersky , D . Metzler and W . B . : Croft Learning Concept Importance Using a Weighted Dependence Model In Proceedings of WSDM 2010 pdf A complete list of publications Invited Talks Discovering Key Concepts in Verbose Queries Technion , Israel Institute of Technology , January 2009 Long Queries and

  • Who Needs Massively Multi-core blog CACM Communications of the ACM

    Updated: 2010-04-08 12:47:17
    Skip to Main Content acm.org Join ACM About Communications Alerts Feeds Sign In Search Term Communications Digital Library Home News Advanced computing news from Communications of the ACM other ACM resources , and from around the . Web Latest News News Archive Blogs About the Blogs BLOG CACM Blogroll Blogs Archive Opinion Articles Interviews Opinion Archive Browse by Subject Recent articles from Communications organized into categories that encompass the broad scope of . computing Artificial Intelligence Communications Networking Computer Applications Computer Systems Computers and Society Data Storage and Retrieval Education Entertainment Hardware Human-Computer Interaction Information Systems Legal Aspects Management Performance and Reliability Personal Computing Search Security Software Theory Magazine Archive March 2010 February 2010 January 2010 Amir Pnueli : Ahead of His Time View More Issues Careers Jump-start your career with information on trends , opportunities and jobs in both industry and academia . View Careers ACM Resources Digital Library ACM Publications Online Books Special Interest Groups Alerts Feeds More About ACM Resources Subscribe Receive the benefits of ACM

  • SemSearch2010 Semantic Search 2010 Workshop

    Updated: 2010-04-08 12:47:07
    Semantic Search Workshop Located at the 19th Int . World Wide Web Conference WWW2010 April 26, 2010 Workshop Day Raleigh , NC , USA Objectives Program Topics Evaluation Organizers Program Committee Submission Proceedings Contact News Important Dates Deadline for standard paper submissions : March 6th , 2010 12.00 AM , GMT Notification of acceptance standard papers : March 28th , 2010 Camera-ready versions of standard papers : April 6nd , 2010 Optional deadline for Entity Search system description submissions : April 10th , 2010 12.00 AM , GMT Deadline for Entity Search Evaluation results : April 10th , 2010 12.00 AM , GMT Notification of acceptance for Entity Search system papers : April 18th , 2010 Camera-ready versions of Entity Search system papers : April 24th , 2010 WWW'10 Conference : April 26th-30th , 2010 Workshop Day : April 26th , 2010 Important Links EasyChair space authors reviewers WWW conference website Workshop Support Objectives In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies , both at web and enterprise scale . However , the representation of user queries and resource content in existing search

  • Geeking with Greg Book review Search User Interfaces

    Updated: 2010-04-08 12:47:05
    : : skip to main skip to sidebar Geeking with Greg Exploring the future of personalized information Thursday , September 17, 2009 Book review : Search User Interfaces UC Berkeley Professor Marti Hearst has a great new book out , Search User Interfaces The book is a survey of recent work in search , but with an unusual focus on the importance of interface design on searcher's perceptions of the quality and usefulness of the search . results Marti writes with the opinionated authority of an expert in the field , usefully pointing at techniques which have shown promise while dismissing others as consistently confusing to users . Her book is a guide to what works and what does not in search , warning of paths that likely lead into the weeds and counseling us toward better . opportunities To see what I mean , here are some extended excerpts . First , on why web search result pages still are so simple and spartan in : design The search results page from Google in 2007 and Infoseek in 1997 are nearly identical . Why is the standard interface so simple Search is a means towards some other end , rather than a goal in itself . When a person is looking for information , they are usually

  • What Will 2010 Bring blog CACM Communications of the ACM

    Updated: 2010-04-08 12:47:03
    Skip to Main Content acm.org Join ACM About Communications Alerts Feeds Sign In Search Term Communications Digital Library Home News Advanced computing news from Communications of the ACM other ACM resources , and from around the . Web Latest News News Archive Blogs About the Blogs BLOG CACM Blogroll Blogs Archive Opinion Articles Interviews Opinion Archive Browse by Subject Recent articles from Communications organized into categories that encompass the broad scope of . computing Artificial Intelligence Communications Networking Computer Applications Computer Systems Computers and Society Data Storage and Retrieval Education Entertainment Hardware Human-Computer Interaction Information Systems Legal Aspects Management Performance and Reliability Personal Computing Search Security Software Theory Magazine Archive March 2010 February 2010 January 2010 Amir Pnueli : Ahead of His Time View More Issues Careers Jump-start your career with information on trends , opportunities and jobs in both industry and academia . View Careers ACM Resources Digital Library ACM Publications Online Books Special Interest Groups Alerts Feeds More About ACM Resources Subscribe Receive the benefits of ACM

  • Jeff's Search Engine Caffè ICWSM 2010 Data Challenge

    Updated: 2010-04-08 12:47:02
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Wednesday , December 9 ICWSM 2010 Data Challenge The ICWSM is a conference on blogs and social media . For the conference , they issued a data challenge The dataset , provided by Spinn3r.com is a set of 44 million blog posts made between August 1st and October 1st , 2008. The post includes the text as syndicated , as well as metadata such as the blog's homepage , timestamps , etc . The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking . The total size of the dataset is 142 GB uncompressed , 27 GB compressed The deadline is March . 1st Something to look at after the SIGIR deadline . Posted by jeff.dalton at 4:53 PM 0 comments : Post a Comment Newer Post Older Post Home Subscribe to : Post Comments Atom About Me Jeff Dalton I'm a Comp Sci grad student in the PhD program at the CIIR at UMass Amherst . Before that , I spent four years as a software engineer at Globalspec building vertical search technology . I graduated from Union College with a degree in Comp Sci . My interests include information retrieval , information

  • Google Research Publication BigTable

    Updated: 2010-04-08 12:47:01
    : Research Publications Google Labs Home Bigtable : A Distributed Storage System for Structured Data Fay Chang , Jeffrey Dean Sanjay Ghemawat Wilson C . Hsieh , Deborah A . Wallach , Mike Burrows , Tushar Chandra , Andrew Fikes , and Robert E . Gruber Abstract Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size : petabytes of data across thousands of commodity servers . Many projects at Google store data in Bigtable , including web indexing , Google Earth , and Google Finance . These applications place very different demands on Bigtable , both in terms of data size from URLs to web pages to satellite imagery and latency requirements from backend bulk processing to real-time data serving Despite these varied demands , Bigtable has successfully provided a flexible , high-performance solution for all of these Google products . In this paper we describe the simple data model provided by Bigtable , which gives clients dynamic control over data layout and format , and we describe the design and implementation of Bigtable . Appeared : in OSDI'06 : Seventh Symposium on Operating System Design and , Implementation Seattle , WA

  • Jeff's Search Engine Caffè Google examines synonym effectiveness in query expansion

    Updated: 2010-04-08 12:46:58
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , January 21 Google examines synonym effectiveness in query expansion Google has used synonyms for query expansion for several years now . It is part of their attempt to find what you mean , not just what you type . Steven Baker , an engineering on the quality team wrote a post covering a recent examination of synonym usage in query expansion . He , writes our measurements show that synonyms affect 70 percent of user searches across the more than 100 languages Google supports . We took a set of these queries and analyzed how precise the synonyms were , and were happy with the results : For every 50 queries where synonyms significantly improved the search results , we had only one truly bad synonym Another tidbit is that Google is expanding their highlighting of synonyms in search result . summaries Lastly , a tip if you get stuck with one the 1 in 50 queries where synonyms go : bad You can also turn off a synonym for a specific term by adding a before it or by putting the words in quotation . marks Bill Slawski has good coverage of the post and previous work on synonym

  • Jeff's Search Engine Caffè Hadoop Eclipse Tip Lib Dependencies

    Updated: 2010-04-08 12:46:54
    : : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Sunday , December 13 Hadoop Eclipse Tip : Lib Dependencies I'm writing a Hadoop job and I ran into a little problem that I wanted to share and remind myself of the solution for the future I am packaging up my Hadoop program into a Jar file . It has external dependencies on text parsers . To include these with my program , one way to do this is to package the dependencies inside the jar in a lib directory . This ensures the jar and all dependencies get copied to the Hadoop . Mappers I create my jar file by right-clicking on the project export Java Jar file . I then select my code and the lib directory . However , the problem I had was that my lib directory was not being exported . I learned that this happens if the jars in lib are on your build path . To solve this , the jars need to be external or in a different folder . Then you can export the lib directory as a . resource Anyone care to share a better solution Posted by jeff.dalton at 4:12 PM 0 comments : Post a Comment Newer Post Older Post Home Subscribe to : Post Comments Atom About Me Jeff Dalton I'm a Comp Sci grad

  • Jeff's Search Engine Caffè September 13, 2009

    Updated: 2010-04-08 12:46:43
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , September 17 Hadoop 0.20.1 released Hadoop 0.20.1 is finally here Get it while it's hot If you want , you can read the full release notes This is the release to use if you are setting up a new cluster . It's also worth upgrading older pre 0.20.x clusters to this . release Hadoop 0.20.x is very different from previous releases . The configuration and APIs have been overhauled . As previously mentioned there is the new TFile storage . format Look for an imminent release of PIG 0.4 release and Cloudera distribution CDH2 0.20.x with Hive and PIG support . Posted by jeff.dalton at 11:41 AM 0 comments Links to this post Wednesday , September 16 Yahoo Key Scientific Challenges Coverage III : Web Information Management Today continues the series part I : search part II : machine learning of Henry s notes from the Yahoo Key Scientific Challenges summit . Today we are covering Brian Cooper s talk on challenges in Web Information Management which deals with structured data , unstructured data , and making structure out of . unstructure Information extraction Goal : from

  • Jeff's Search Engine Caffè February 8, 2009

    Updated: 2010-04-08 12:46:43
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , February 12 WSDM 2009 Best Papers and other Highlights from Matt Lease See previous WSDM 2009 coverage here and here Things are a bit crazy at WSDM and , as usual , Internet connectivity is spotty . However , in his exhausted state Matt Lease sent me a few highlights . He's been doing some great research with us here at Amherst and is graduating this summer in case anyone is interested in that sort of . thing How many CS PhDs does it take to fix a projector 159. 9 actively , 150 . watching After 30 min they managed to get it working by plugging into the projector , meaning presenters had to do hand signals to get slide changes for rest of session everyone had their own distinct style of coping with the absurdity of the . situation Best Paper Awards Best Paper Integration of News Content into Web Results by Fernando Diaz congratulations A recent CIIR alum brings in another best paper award Best Student Paper The Web Changes Everything : Understanding the Dynamics of Web Content by Eytan Adar Jon Elsas Jamie Teevan , and Susan Dumais congratulations Jon and Eytan Best

  • Jeff's Search Engine Caffè March 22, 2009

    Updated: 2010-04-08 12:46:38
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Friday , March 27 Statistical Learning of Semantics from Web Data Greg wrote a post on an article in the April 2009 IEEE Intelligent Systems , The Unreasonable Effectiveness of Data by Alon Halevy Peter Norvig , and Fernando Pereira . It's on a similar talk as Peter's CIKM 08 industry day talk , Statistical Learning as the Ultimate Agile Development Tool In it the Googlers cover statistical learning of semantic interpretations from large quantities of information . They highlight the TextRunner project and Michael Cafarella s related work at UW extracting schema from tables on the web . They also highlight Marius Pasca's work , Organizing and Searching the World Wide Web of Facts . Step Two : Harnessing the Wisdom of the Crowds which demonstrates extracting entity classes from free web text and large query . logs A few excerpts . First , on leveraging the schemas extracted from the myriad of tables on the : web What we need are methods to infer relationships between column headers or mentions of entities in the world . These inferences may be incorrect at times , but if they’re

  • Jeff's Search Engine Caffè October 18, 2009

    Updated: 2010-04-08 12:46:36
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Friday , October 23 Conferences Coverage : RecSys09 and HCIR09 I'm not attending either , but trying to follow what's going . on The 2009 conference on recommendation systems in NY is happening this weekend . Follow the conference on Twitter , recsys09 I'm particularly looking for coverage on the Netflix Challenge panel : What did we learn from the Netflix Prize Perspectives from some of the leading contestants The HCIR Workshop is also taking place in DC . Daniel is one of the chairs . You can also see other coverage on hcir09 The proceedings for the workshop are available Henry is attending and taking part in a panel , so hopefully I'll be able to share some of his . highlights Posted by jeff.dalton at 11:56 AM 0 comments Links to this post Tuesday , October 20 Why I Don't Want Your Search Log Data The IR field is largely driven by empirical experiments to validate theory . Today , one of the biggest perceived problems is that academia does not have access to the query and click log data collected by large web search engines . While this data is critical for improving a search

  • Jeff's Search Engine Caffè February 22, 2009

    Updated: 2010-04-08 12:46:36
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Wednesday , February 25 EntityCube and Opinion Organization at MSR TechFest TechFest is a global gathering of Microsoft researchers from around the world to show off their projects and exchange ideas . The Live Search blog highlights some of the search projects Some of the highlights : include EntityCube EntityCube is an entity search and summarization system that efficiently generates summaries of Web entities from billions of crawled Web pages . The summarized information is used to build an object-level search engine about people , locations , and organizations and explore their . relationships It is used in Live Product search to extract names , descriptions , images , and prices . It is also used to create structure for the Libra academic search Opinion Search Which collects and organizes review data around products and services . It's currently used to create the opinion index in Live product search . Posted by jeff.dalton at 7:35 AM 4 comments Links to this post Monday , February 23 Theory of Information Retrieval Conference CFP Microsoft Research Cambridge is hosting a

  • Jeff's Search Engine Caffè April 19, 2009

    Updated: 2010-04-08 12:46:34
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , April 23 NSF Clue Award for Mining Semantic Word Relationships Google congratulated the projects that were awarded 2009 CLuE grants that includes access to the Google IBM cluster . Our lab received a grant to work on mining word relationships from large corpora The particular focus is on techniques that create and use Web-based corpora of comparable sentences and text chunks for estimating word and phrase translation probabilities , and on techniques that derive relationships from context vectors that represent word and phrase . meanings Part of the project will also upgrade Trevor's work on TupleFlow to work with Hadoop . Posted by jeff.dalton at 4:52 PM 0 comments Links to this post Wednesday , April 22 WWW 2009 Papers and Workshops This week WWW 2009 is happening in Madrid . The papers and many presentations are available on eprints For web search , the AIR Web Workshop aka Web Spam proceedings are also online . Posted by jeff.dalton at 2:32 PM 0 comments Links to this post SIGIR 2009 accepted papers published The list of accepted papers is now . available Here are a

  • Jeff's Search Engine Caffè Semantic Search Competition

    Updated: 2010-04-08 12:46:25
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , March 4 Semantic Search Competition Peter Mika highlights the Semantic Search competition at the upcoming Semantic Search 2010 workshop at WWW 2010 From Peter's post Participants will be given queries sampled from a web search query log provided by the Yahoo Webscope program , and have to try to answer those queries using the Billion Triples Challenge corpus from 2009. The queries that are selected are all entity queries in that they are looking to find information about a single . entity This is an interesting competition because it attempts to use unstructured web queries to do retrieval over a heterogeneous collection of structured data . The Billion Triples collection contains data from DBpedia extracted from Wikipedia Geonames , a variety of social networks , and other . sources There's a group of us here working on an entry we'll see how it . goes Posted by jeff.dalton at 10:40 AM 0 comments : Post a Comment Newer Post Older Post Home Subscribe to : Post Comments Atom About Me Jeff Dalton I'm a Comp Sci grad student in the PhD program at the CIIR at UMass Amherst

  • Jeff's Search Engine Caffè TREC Entity Track 2009 into 2010

    Updated: 2010-04-08 12:46:20
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Tuesday , March 9 TREC Entity Track 2009 into 2010 Krisztian posted a link to the TREC 2009 Entity Track Overview part of the TREC 2009 proceedings The track website has information on the 2009 track and what is planned for 2010. One change they are seeking discussion about is a new semantic entity search : subtask We propose a semantic entity search subtask for 2010 : return URIs of related entities , instead of their homepages . We are planning to enrich topics with URIs of the input entities . URIs need to come from a predefined set of semantic data sources which will include DBPedia and Freebase , at least The plan is to use the full category A set of ClueWeb09 which has 500 M English web pages instead of the smaller B subset which doesn't contain many entity homepages . Posted by jeff.dalton at 2:22 PM 0 comments : Post a Comment Older Post Home Subscribe to : Post Comments Atom About Me Jeff Dalton I'm a Comp Sci grad student in the PhD program at the CIIR at UMass Amherst . Before that , I spent four years as a software engineer at Globalspec building vertical search

  • Blogger User Profile Jeff Dalton

    Updated: 2010-04-08 12:46:19
    : : Push-Button Publishing Jeff Dalton : Gender Male : Industry Internet : Occupation Phd Student : Location Northampton MA United States About Me I'm a Comp Sci grad student in the PhD program at the CIIR at UMass Amherst . Before that , I spent four years as a software engineer at Globalspec building vertical search technology . I graduated from Union College with a degree in Comp Sci . My interests include information retrieval , information extraction , and software engineering . In my spare time I like to cook read historical fiction , and hike in the New England mountains . You can reach me at jeffdalton104-at-hotmail-dot-com or JeffD on Twitter Interests Search engines information retrieval google java programming artificial intelligence machine learning aspect-oriented programming text classification web crawling software engineering user interface design information extraction faceted search cooking molecular gastronomy and byzantine . history Favourite Books Effective Java Programming AI : A Modern Approach How to Cook Everything by John Bittman Mastering the Art of French Cooking Volume 2 Adirondack Trails : High Peak Region My Blogs Team Members Jeff's Search Engine

  • Jeff's Search Engine Caffè Ranking Real-Time Results Interview with Amit Singhal

    Updated: 2010-04-08 12:46:18
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Thursday , January 14 Ranking Real-Time Results Interview with Amit Singhal Yesterday , Technology Review posted an interview with Amit Singhal on How Google Ranks Tweets According to Amit , one key is to find reputed followers You earn reputation , and then you give reputation . If lots of people follow you , and then you follow someone--then even though this new person does not have lots of followers , his tweet is deemed valuable because his followers are themselves followed widely . It seems like pretty straightforward translation of PageRank with following as a form of link endorsement . In this vein , see also Daniel's TunkRank The interview goes on to mentions the use of geolocation in tweets as a next likely step . Amit also rightly points out that blogs and news organizations are important components of real-time search it's not just tweets . Posted by jeff.dalton at 6:24 PM 2 comments : Kat said . Interesting though , I follow several companies and high profile companies on Twitter . As soon as I follow them most of them automatically follow me back . So unless they have

  • Jeff's Search Engine Caffè March 7, 2010

    Updated: 2010-04-08 12:46:15
    : Jeff's Search Engine Caffè Information Retrieval research and search engine development . discussion Tuesday , March 9 TREC Entity Track 2009 into 2010 Krisztian posted a link to the TREC 2009 Entity Track Overview part of the TREC 2009 proceedings The track website has information on the 2009 track and what is planned for 2010. One change they are seeking discussion about is a new semantic entity search : subtask We propose a semantic entity search subtask for 2010 : return URIs of related entities , instead of their homepages . We are planning to enrich topics with URIs of the input entities . URIs need to come from a predefined set of semantic data sources which will include DBPedia and Freebase , at least The plan is to use the full category A set of ClueWeb09 which has 500 M English web pages instead of the smaller B subset which doesn't contain many entity homepages . Posted by jeff.dalton at 2:22 PM 0 comments Links to this post Data-Intensive Text Processing with MapReduce Updated Book Draft An updated draft of the upcoming book , Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer is available The book isn't finished , but it still has interesting

Current Feed Items | Previous Months Items

Mar 2010 | Feb 2010 | Jan 2010 | Dec 2009 | Nov 2009 | Oct 2009