"Be liberal in what you require but conservative in what you do" from http://www.w3.org/DesignIssues/Principles.html "the map is not the territory" - https://en.wikipedia.org/wiki/On_Exactitude_in_Science - https://en.wikipedia.org/wiki/Simulacra_and_Simulation an ontological assimilator takes arbitrary web pages and transforms them into an abstract data structure, including references to binary data blobs, functions, and user interface elements - reference implemetation: - steamroller (the assimilator) - steamtrain (ui to tell it what's what, a chrome plugin) - anorak (inspects contributions from steamtrain) - dddozer (backend to do heavy processing of raw data i.e. image processing, OCR, object recognition) LCARS then renders the distilled data structure as a 2d interface suitable for viewing and interacting on a touchscreen - other interfaces - ncurses - command line (what happened to surfraw?) - stdin/stdout/stderr - fuse www filesystem (lazy download) - emacs - spoken - 3d AR/VR - failsafe: on error in classification or parsing, dump the original HTML as it was formatted bayesian classifier - classifies html subtrees as content or user interface elements - can be trained with feedback from user or administrator or webmaster useless noise data is junked with extreme prejudice - junk data is still accessible but not prevalently displayed or executed unless asked - tracking beacons, like buttons, ads, third party services - music players, sparkling unicorn mouse cursors - the specific color theme and fonts the webmaster used - the specific music player, video player, lightbox, file browser, button style all data can be manipulated with the user generic interface functions - zoom - save as - classify - comment on - clone - transclude (clone with updates from upstream) - serialize - as JSON - as YAML - as RDF - as HTML - find neighbors - similar content - data that links here - data linked from here - data that should have been linked to/from here based on a cross reference index - wikipedia overview/category pages - known familial or personal relationships - data with matching metadata obviously there is no end to this certain types of data have specialized functions - table: - sort by header - sum column, average, plot as... - symbolic regression - units: - show SI unit - show original representation - visualize as objects of equivalent unit - address: (email, web, street) - get, put, bookmark - geolocate - chronological: (blog, facebook wall, twitter stream) - plot timeline - correlate to time series text in images is automatically OCRed and overlaid on thumbnails as dynamically generated callouts - project naptha http://projectnaptha.com - callouts respect page text flow - this is more of an interface thing isn't it? graphs are estimated and transformed to data tables equations are parsed into functional code like mathematica - sage integration? frontend problem. but what data format does it want? TeX is for typesetting webpages are locally cached as their template and the data that populates the template - respect do not cache flag? - copyright lawsuits, cease/desist - old versions remain accessible cache server - tahoeFS? DHT? (out of my depth here) - stores webpages as templates and data - clients can download from cache server instead of original source - faster, only data is downloaded, not web template - less work to assimilate page - closest cache server is chosen, less overall network load (like CDN) - stores OCRed image text - stores image thumbnails - cached results may be randomly witheld to prompt client to do some work, need multiple random clients to verify that it's good data (like how reCAPTCHA works) - human classification might not be such a terrible thing to do - mechanical turk to bootstrap cache and use as training data - cache server may notify client of duplicate pages, data, or content - plagiarism detector - find chronological original - find higher resolution copy - dereference quotes and memes to some degree allies: - d3 - css - rss - ranger - tracker - nepomuk? - freebase - bayes - users enemies: - flash - DRM - java applets - lawyers - captcha neutral: - web developers - content producers - spambots semantic data - what is this, really the semantic web - nobody uses it, why? microformats ontological warfare