Transcript Document
NUWeb System [email protected] WWW Architecture • Web Server (e.g., Apache, IIS) • Browser (e.g., IE, Firefox) • Addressing and Information Channel (DNS, URL, SearchEngine) • Abstract Model: – Provider (server), Consumer (client), Channel – Client-Server architecture, Centralized Service Problems of the WWW due to the fundamental design • Naming/Addressing problem: – – – – • Physical naming/addressing Static Binding through DNS URL may not be a good design, (hard-to-remember) DNS could be slow Information flow organization not designed in the first place, – Hotspot bottleneck problem, bandwidth waste problem, – Cache and Proxy tech are added separately afterwards, • Linkrot problem – Dead links, wrong links, faked links, – Approximately up to 15% of links • Need static IP, need to apply for URL, need knowledge in building up and managing Websites – Creating and maintaining a website is costly – Webpage creation is not easy • Divide the computer world into two hierarchies – Server: Website owners, service providers – Client: ordinary users Weaving the Web (quoted from wikipedia) • In Berners-Lee's book, Weaving the Web, several recurring themes are apparent: – It is just as important to be able to edit the Web as browse it. Wikis are a step in this direction, although Berners-Lee considers them merely a shadow of the WYSIWYG functionality of his first browser. – Computers can be used for background tasks that enable humans to work better in groups. – Every aspect of the Internet should function as a Web, rather than a hierarchy. Notable current exceptions are the Domain Name System and the domain naming rules managed by ICANN. – Computer scientists have a moral responsibility as well as a technical responsibility. What Is NUWeb? • Marriage of WWW with P2P • Technologically: – NUWeb = WebServer + Browser + WNS + SearchEngine + Proxy/Cache + WebBuilder + Blog + CommunityEngine + KIM + P2P – URL – DNS and – Cost • Logically: – A New Web System for any net user to build his/her own web in an extremely easy-to-use way. – A platform for web-building, information sharing, information management, community, and service management • A platform for Webilization • A project to pursue Wemocracy NUWeb Functions • A platform for Public Sharing and Publishing – Personal website/blog – Public community – Search Engine, • A platform for Private Sharing and Community – Personal community builder – Sharing management • A platform for personal information / knowledge management, content engine, NUWeb Software Architecture • NUWeb system is composed of three subsystems – NUWeb.CC CyberCenter • WNS, (web name service), • Search engine, Cache • Commuity services, (Photo, Blog, Video…) – NUWeb CP (Community Portal) • Community services, (Blog, Photo, Video…) • Search Engine service, • Proxy and Cache – NUWeb PP (Personal Portal) • NUWeb browser, kim, • NUWeb server, • NUWeb personal portal/blog builder How it works • Personal Web server on Windows platform – – – – Auto indexing, thumbnail, Auto page generation and run-time rendering Auto caching, Bundled with php/perl platform • Registration to WNS in the set up, – Site name, user-account, SiteKey, … • UPNP to handle firewall/NAT, • Packet forwarding Proxy to handle the cases where UPNP does not work correctly. How it works (2) • Each time a client gets on line, send the current IP and name/key info to the WNS center. • The connection request to a personal site will first send the name of the site to the WNS to get the IP of the target site (dynamic binding) • If the requested site is not online, then the center will redirect the request to the cache server. • If the site is connected through proxy, then connect it through relay proxy. Naming and Dynamic Addressing – A page is a textual web document. It contains UltraLinks or tags and the display of such page might instantiate the display of some other objects such as included images. – An object is either a richtext document such as pdf, msdoc, msppt, etc., a multimedia file, or any singular file that can be accessed in the web space. – A resource is either a page or an object – GRN, global resource naming • SiteUniqName#objectname[#class#type#location] – fixed IP is not necessary – ABN (AddressByName), ABI (AddressById), ABC(AddressByContent) – USI (UniversalSiteId), NUWeb CyberCenter • GRI: Global Resource Index – A distributed index structure for objects/pages on the NUWeb space – Use hash data structure • Search engine, Community Service, Portal for NUWeb • Proxy & Caching – – – – – – Auto backup and versioning Info filtering, content switching Packet forwarding, center relay Relay casting, media streaming Hierarchical search Collaborative cache (super cache) Site Initialization • When a new site is installed: – Register the following info • • • • SiteUniqName, to be interacted by the center Titles of the site (at most T bytes) Abstract of the site (at most P bytes) tags, (if inappropriate, such as infringing others right, will be abolished by the center) • Country/city/county, real world geography info • Profile of personal info • Residents : SUN.resident will identify a user – Decide which directories to be open to public – Decide which directories to be open to private connections – Decide whether to open caching of the public directory Site Initialization • The server will build an index for the pages/objects that are covered in the site . The index for public and private areas are separated such that the privacy will be secured. • The index is on the name and signature level, plus the content of pages, the support for object content index such as ms-doc files pdf files will be optional • After the site is set up, the user will be asked to provide a list of friends to which the system will send invitation letters. NUWeb Services • • • • • • • • NUSite, NUBlog NUSearch, NUSM NUCommunity, NUBBS, NUBot, NUWatch, NUPush NUCache, NUProxy NUPedia, knowledge authoring/manager NUMail, P2P secure mail system NUJournal Searching • The search in the nuweb center includes: – Search pages/objects by name (WNS) – Page content search – * attributed search , for example, search for pages authored by Hamming • The indexer in each nusite will send the raw-index to the center, and the center will build an index . The raw-index is a record containing indexable texts for each page or object. A text extractor will be used to extract text from rich text documents such as MS-DOC/PPT documents. The upload of such raw index will get approval from the users first. • Before rendering the search result to the user, the searcher needs to check whether the result page/object exists at that moment. • It uses the SSN to check the SiteDB and to see whether that site is avalable. It also use grn to check where such resource is available in the cache. Caching • Caching – Every site page will be automatically cached, unless explicitly disabled – In the first phase, the caching will be done in the center and the NUWeb CP cache spaces. Objects will be cached if accessed • The client will cache it in its cache spool, and an index will be sent to the center to notify the center that it has such object in cache. – In the second phase, the caching will be done by collaborative caching in the p2p space too, assuming that some of the personal sites are willing to participate. – The cache object will be indexed by GRN and MD5 – Note that if an object is modified, it will trigger a update to the global cache space to remove the original cache indexed by GRN – Each cache object will record a timestamp of the content (the time such content is created.) GRI & Collaborative Proxy • GRI: – Object indexed by MD5-signature & GRN – Home page indexed by GRN – Instance indexed by MD5 • Syntax: – GRN: SUN#OBN • Distributed/Collaborative GRI • Multi-tier Collaborative Proxy Indices (1) • In the nuweb center, there are several indices: – SiteDB: indexed by SSN • Last live time, access cnt, data size, • When alive, each site will periodically send alive info to the center (every K minutes) – NameDB: indexed using gaisindex • Each name is associated with a SSN by which we can check whether such page/object exists. • Each name will have a record, which will have a SSN value, and a GRN cache flag • In the search result of name db, if a record does not have a online instance (either roiginal site or the cache copy), it will have a flag indicating “not available” Indices(2) – MD5 index, objects/pages indexed by MD5 signature. Each site will produce MD5 signatures for each object, and the (grn,md5) info will be sent to the center to be indexed.The return of a MD5 lookup is the source SSN/IP or the cache site/s IP – Page/document Content index • Indexed through gais search engine NUWeb Portal Service • Search engine for the NUWeb cyberspace – Websites, pages, pictures, videos, documents, articles, etc., … • Browsing and Viewing – What’s hot, what’s new, what’s cool, – Automatically generated through page rendering tool based on a CountDB and list manager. NUWeb DB • NUWeb cache is implemented through NUWeb DB system. • NUWeb DB is to store Web Objects and relationship and provide search function. – Web DB: • • • • • • • • • • ODB, (Object DB) NDB, (Name DB) IDB, (Index DB) TDB, (Term DB) UDB, (User DB) SDB, (Site DB) Page Engine Access Log DB (PV DB) Access Control Query Interface (including SQL) * Web DB implementation • ODB and NDB is the kernel storage DB • The key technique used in ODB and NDB is the Hash DB which needs to minimize the disk seeks and maximize the memory usage. • PV DB (Access log DB) is implemented on top of ODB and NDB. • Term DB is implemented on top of ODB too. Term DB will record the term frequency, term score … information. Web DB implementation (2) • Site DB records the site info such as access frequency, size, dynamics, etc. • IDB is a real time index engine for all the objects stored in Web DB. • Access Control: – Authorization: permission list based – Authentication: through an authentication center in WNS server. • SQL is not supported yet, on the todo list. NUDB • Net User’s DataBase • Easy to use, – No background of database is needed. – No need to program – Define the spec and start to use, • Spec can be adjusted flexibly – Scalable • Combine the advantages of Table processing software such as Excel and Database systems • Portable, computable, mergeable NUDB implmentation • Physical DB Kernel – Hash DB – Inverted Index – Pattern Matching • Schema Layer, and Query Processing • User Interface Layer – Data Presentation Management – DUA (Database User Agent, 類似 MUA) NUBlog • AJAX Based Blog System • Personal Blog Home Base – Can have multiple copies in the web – Creation, Management, Posting • Import, Export: – XMLRPC – Robot, simulating Browser behaviour NUWatch • • • • • Personal Web Agent Event Watch, News Watch Service Watch, Site Watch, Commerce Watch, NUWatch Implementation • Personal Profile Manager • Matching Platform – On the fly matching – Batch mode matching through searching • Data Source Agents – Per user agent – Centralized agent (can reduce overhead) • Notification Agent – Relay casting to speed up – Gateway to message system NUCommunity • Personal and Regional Community Engine – – – – Forum, Vote, Calendar, File Sharing, Address Book, DB, .. Interaction mechanism, (auto notification,..) • A community is conceptually a given a NUWeb site • A community is treated like a user in the NUWeb space’s authentication and authorization Access Control • Support both password-based and membership based protections. • Each directory is associated with a protection data structure • Authentication in WNS server • Use Permission List technique for membership based protection • The protection is a directory base, no inheritance will be assumed. NUJournal • Why the publication is through paper?! – Traditionally, publication HAD TO BE published through paper in the old age – Journal is both a channel and a barrier – Most of the papers entered the dead state once published • A new model of publication – Separate the concept of publication and evaluation – Publication is an autonomous will, and publication can be through own website!, reviewed, commented by readers, or reviewers. – Journal is a marketplace to glue/guide the accesses of publications and to comment and evaluate the publications – A publication can be a long time living object – Other authors can join the published work along the time, if they make substantial contributions to the work. – A publication is evaluated by its contribution and impact. Thanks!