"Be liberal in what you require but conservative in what you do"
from http://www.w3.org/DesignIssues/Principles.html

"the map is not the territory"
- https://en.wikipedia.org/wiki/On_Exactitude_in_Science
- https://en.wikipedia.org/wiki/Simulacra_and_Simulation

an ontological assimilator takes arbitrary web pages and transforms them into an abstract data structure, including references to binary data blobs, functions, and user interface elements
- reference implemetation:
  - steamroller (the assimilator)
  - steamtrain (ui to tell it what's what, a chrome plugin)
  - anorak (inspects contributions from steamtrain)
  - dddozer (backend to do heavy processing of raw data i.e. image processing, OCR, object recognition)

LCARS then renders the distilled data structure as a 2d interface suitable for viewing and interacting on a touchscreen
- other interfaces
  - ncurses
  - command line (what happened to surfraw?)
    - stdin/stdout/stderr
    - fuse www filesystem (lazy download)
  - emacs
  - spoken
  - 3d AR/VR
- failsafe: on error in classification or parsing, dump the original HTML as it was formatted

bayesian classifier
- classifies html subtrees as content or user interface elements
- can be trained with feedback from user or administrator or webmaster

useless noise data is junked with extreme prejudice
- junk data is still accessible but not prevalently displayed or executed unless asked
- tracking beacons, like buttons, ads, third party services
- music players, sparkling unicorn mouse cursors
- the specific color theme and fonts the webmaster used
- the specific music player, video player, lightbox, file browser, button style

all data can be manipulated with the user generic interface functions
- zoom
- save as
- classify
- comment on
- clone
- transclude (clone with updates from upstream)
- serialize
  - as JSON
  - as YAML
  - as RDF
  - as HTML
- find neighbors 
  - similar content
  - data that links here
  - data linked from here
  - data that should have been linked to/from here based on a cross reference index
    - wikipedia overview/category pages
    - known familial or personal relationships
  - data with matching metadata

obviously there is no end to this

certain types of data have specialized functions
- table:
  - sort by header
  - sum column, average, plot as...
  - symbolic regression
- units:
  - show SI unit
  - show original representation
  - visualize as objects of equivalent unit
- address: (email, web, street)
  - get, put, bookmark
  - geolocate
- chronological: (blog, facebook wall, twitter stream)
  - plot timeline
  - correlate to time series

text in images is automatically OCRed and overlaid on thumbnails as dynamically generated callouts 
- project naptha http://projectnaptha.com
- callouts respect page text flow
  - this is more of an interface thing isn't it?

graphs are estimated and transformed to data tables

equations are parsed into functional code like mathematica
- sage integration? frontend problem. but what data format does it want? TeX is for typesetting

webpages are locally cached as their template and the data that populates the template
- respect do not cache flag?
  - copyright lawsuits, cease/desist
  - old versions remain accessible

cache server
- tahoeFS? DHT? (out of my depth here)
- stores webpages as templates and data
  - clients can download from cache server instead of original source
    - faster, only data is downloaded, not web template
    - less work to assimilate page
    - closest cache server is chosen, less overall network load (like CDN)
- stores OCRed image text
- stores image thumbnails
- cached results may be randomly witheld to prompt client to do some work, need multiple random clients to verify that it's good data (like how reCAPTCHA works)
  - human classification might not be such a terrible thing to do
    - mechanical turk to bootstrap cache and use as training data
- cache server may notify client of duplicate pages, data, or content
  - plagiarism detector
  - find chronological original
    - find higher resolution copy
  - dereference quotes and memes to some degree

allies:
- d3
- css
- rss
- ranger
- tracker
- nepomuk?
- freebase
- bayes
- users

enemies:
- flash
- DRM
- java applets
- lawyers
- captcha

neutral:
- web developers
- content producers
- spambots

semantic data
- what is this, really

the semantic web
- nobody uses it, why?

microformats

ontological warfare