You are here

Data Science Map.

A wip towards organizing and mapping out the toolset needed for my career and various projects. The focus this year has been filling in gaps around dataviz and significantly augmenting my data modeling skills. Some work left there around network and dynamic systems. Decision modeling is a new concept to me and needs maturity.

Overview.

  1. Gathering.
  2. Describing.
  3. Modeling.
  4. Deciding.
  5. Visualizing.

Gathering.

  • Manual: Notebooks, observations, etc
  • Pulling: Database, log files, etc
  • Scraping (ie from web): beautifulsoup, wptools, dbpedia
  • Cleaning: filter, drop, replace, regex (sheets, python, pandas, R)
  • Parsing: filter, sort, pivot, SQL query (sheets, python, pandas, R)
  • Merge: (sheets, python, pandas, R)
  • Store: Notebook, sheets, csv, database, etc

Describing.

  • Numerical: simple retrieval, min, max, average, percentage of whole, etc (python, numpy, scipy, R, sheets)
  • Function: class, fit, derivative, integration, limits, monotonicity (python, numpy, scipy, R, sheets)
  • Statistical: mu, sigma, pdf, cdf, confidence, etc (python, numpy, scipy, R, sheets)
  • Dynamic: coefficients, gain, steady-state, overshoot, settling time, etc
  • Network: path, distance, depth, #nodes, #edges, internal composition and organization (networkx, gephi, cytoscope)

Modeling.

  • Numerical: algebraic, geometrical, etc
  • Regression: curve fitting (ie function class and coefficients), correlation, error bars
  • Limits: tolerance analysis, sensitivity analysis, threshold analysis
  • Optimization: min, max, rate of change, etc (python, numpy, scipy, R, sheets)
  • Statistical: manufacturing limits, ANOVA, monte carlo (time and aggregrate)
  • Dynamic: system diagramming and equations, impulse response, transient response
  • Network: network diagramming and equations, game theory, simulated annealing, search algorithms

Deciding.

  • Forecasting/Risk Analysis: regression analysis, monte carlo (time series and aggregate), game theory
  • Set Point/Tuning: system inputs sweeping, hypothesis testing, necessary condition (threshold) analysis
  • Decision Matrix, Confusion Matrix, ROC plots, Decision Theory
  • Data Classification, Machine Learning, Deep Learning

Visualizing.

  • Time: 1-D amplitude, SPC plots, strip (seaborn, matplotlib, sheets)
  • Relationship (Causal): XY scatter/bubble, pairs plots, line, vector (seaborn, matplotlib, sheets)
  • Matrix (Area): heat [w/ clustering], geographical, tree/other maps (seaborn, matplotlib, sheets)
  • Aggregate (Composition): distribution, columnar [w stacking], pie, box, strip (seaborn, matplotlib, sheets)
  • Network: trees (directed), undirected, flow charts (e.g. sankey) (graphviz, pydot, gephi, cytoscape)

Notes.
Things to consider when choosing a model:

  1. Initial Conditions
  2. Boundary Conditions
  3. Mathematical vs Logical
  4. Deterministic vs Stochastic
  5. Linear vs Nonlinear
  6. Static vs Dynamic
  7. Explicit vs Implicit
  8. Discrete vs Continuous

    Add new comment

    Filtered HTML

    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <table> <tr> <td> <ul> <ol> <li> <dl> <dt> <pre> <dd> <img> <sub> <sup>
    • Lines and paragraphs break automatically.

    Plain text

    • No HTML tags allowed.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Lines and paragraphs break automatically.
    CAPTCHA
    This question is for testing whether you are a human visitor and to prevent automated spam submissions.
    5 + 4 =
    Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.