post

Web Scraping and Data Mining Course Information Session

If you want to learn how to scrape and mine insights from data on the web, regardless of the level of technical knowledge you have, then our Web Scraping and Data Mining course is for you.

Join us December 8th, 2015 at 1PM (New York) and 10 AM (San Francisco) for a free information session on the course.

In this session you’ll:

  • Get more details on what you’ll learn during the course
  • Meet the instructor, Robert Dempsey
  • Find out about available payment options

In addition, you’ll be able to ask any questions you have.

post

Socially Responsible Algorithms for Data Science

For our November Data Science DC, we’re thrilled to have to speakers talk about machine learning algorithms and how they can be socially responsible, or not. Mike Williams from Fast Forward Labs in NYC will be talking about how supervised learning algorithms can amplify existing inequality, and Prof. Lisa Singh from Georgetown will present her research on the digital trails we leave on the web. Much of the Big Data revolution is our new ability to analyze extensive, detailed data generated by people and their behavior — these presentations will provide you with insights into what you can do better to leverage this data in a responsible way.

NOTE: We’ll be at Pew Research this month! Thank to them for hosting!

Agenda

  • 6:30pm — Networking, Empanadas, and Refreshments
  • 7:00pm — Introduction, Announcements
  • 7:15pm — Presentations and Discussion
  • 8:30pm — Data Drinks (TBA)

Presentations

Mike Williams

This talk will use the example of sentiment analysis to show that supervised machine learning has the potential to amplify the voices of the most privileged people in society. A sentiment analysis algorithm is considered ‘table stakes’ for any serious text analytics platform in social media, finance, or security. As an example of supervised machine learning, I’ll show how these systems are trained. But I’ll also show that they have the unavoidable property that they are better at spotting unsubtle expressions of extreme emotion. Such crude expressions are used by a particularly privileged group of authors: men. In this way, brands that depend on sentiment analysis to ‘learn what people think’ inevitably pay more attention to men. But the problem doesn’t stop with sentiment analysis: at every step of any model building process, we make choices that can introduce bias, enhance privilege, or break the law! I’ll review these pitfalls, talk about how you can recognise them in your own work, and touch on some new academic work that aims to mitigate these harms.

Mike Williams is a research engineer at Fast Forward Labs, which helps organizations accelerate their data science and machine intelligence capabilities by profiling near future technologies from academia and elsewhere, producing reports on their development and prototypes demonstrating their application. He has a PhD in astrophysics from Oxford, and did postdocs at the Max Planck Institute in Munich and at Columbia University. Follow Mike on Twitter @williamsmj_.

Lisa Singh

Helping Users Understand Their Web Footprint

With the growth of online social networks and social media sites, the increase in dynamic web content, and the popularity of digital communication, more and more public information about individuals is available on the Internet. This talk will present a novel information exposure detection framework that generates and analyzes the web footprints users leave across the social web. Our approach uses probabilistic operators, free text attribute extraction, and a population-based inference engine to generate the web footprints. Using a web footprint, the framework then quantifies a user’s level of information exposure relative to others and makes suggestions on which attributes to remove or hide. After presenting the framework, I will show an evaluation that quantifies information exposure on public profiles from Google+, LinkedIn, FourSquare, and Twitter. Finally, the talk will conclude with a brief discussion about data privacy and ethics.

Dr. Lisa Singh is the Graduate Director and an Associate Professor in the Department of Computer Science at Georgetown University. Her research interests are in data science, data mining, and databases. She currently has funding from NSF to study privacy on the web (adversarial inference), dolphin social structures with the Shark Bay Dolphin Research project (graph databases, visual analytics and social mining) and forced migration with the Institute for the Study of International Migration, LLNL, York University and others (text mining, graph mining, event detection). Dr. Singh received her B.S.E. degree from Duke University and her M.S. and Ph.D. degrees from Northwestern University. She has served on many organizing and program committees, including KDD, ICDM, ICDE, and SIGMOD, and is currently involved in different organizations related to women in computing and computational thinking education for K-12.

post

Data Drinks: National Data Community Happy Hour!

Data science Meetup organizers and other data science community members from all around the country will be in town for a conference on November 5th and 6th.

Please join Data Community DC in welcoming them as we host a special Data Drinks Happy Hour at Rock Bottom in Ballston on November 5th from 6pm – 8pm. Come welcome them to our data community, get to know them, and learn about all the interesting things they are doing in their respective data communities nation-wide!

Important: Space is limited, so please only RSVP if you are actually attending.

post

Predictive Models in Python

If you’ve been reading books and blog posts on machine learning and predictive analytics and are still left wondering how to create a predictive model in Python and apply it to your own data, this presentation will give you the steps and code you need to do just that.

You’ll learn how to go from raw data to a trained predictive model you can implement in a production system, and then how to implement it in production.

Speaker

Robert Dempsey is tested leader and technology professional delivering solutions and products to solve tough business challenges. His experience forming and leading agile teams combined with more than 15 years of technology experience enable me to solve complex problems while always keeping the bottom line in mind.

He’s founded and built three startups in tech and marketing, developed and sold online applications, consulted to Fortune 500 and Inc. 500 companies, and spoken nationally and internationally on software development and agile project management.

In addition, he’s the author of the soon-to-be-released “Python Business Intelligence Cookbook“.

post

Creating Your First Predictive Model in Python

If you’ve been reading books and blog posts on machine learning and predictive analytics and are still left wondering how to create a predictive model in Python and apply it to your own data, this presentation will give you the steps and code you need to do just that.

You’ll learn how to go from raw data to a trained predictive model you can implement in a production system, and then how to implement it in production.

Biography: Robert Dempsey

Robert Dempsey is tested leader and technology professional delivering solutions and products to solve tough business challenges. His experience forming and leading agile teams combined with more than 15 years of technology experience enable me to solve complex problems while always keeping the bottom line in mind. He’s founded and built three startups in tech and marketing, developed and sold online applications, consulted to Fortune 500 and Inc. 500 companies, and spoken nationally and internationally on software development and agile project management. In addition, he’s the author of the soon-to-be-released “Python Business Intelligence Cookbook“.

Agenda

  • 6:30 – Drinks/Apps and Networking
  • 7:00 – Introduction
  • 7:15 – Talks
  • 8:30 – Wrap up/shut down

Getting To The Meetup

AddThis HQ is located next to the Silver lines’ Spring Hill Metro station. Free parking is also available. If you have trouble getting in please call Brad at 571.278.5205

Food

AddThis will provide beer, sodas, and food.  There are plenty of local bars available if people would like to continue the discussion after the talk.

post

How to Choose a Data Science Tool

On November 16th, 2015 at 1:00 PM (New York) and 10 AM (California), Robert Dempsey, author of the Python Business Intelligence Cookbook, will take you through the steps for choosing a data science tool.

On this webinar you’ll discover:

  • The four phases of selecting the tool that’s right for you and your team
  • 10 key points to consider before you start your evaluation
  • Tips on how to perform your research so you don’t waste your time during the evaluation phase
  • How best to structure your time during the evaluation to keep productivity high and have the time you need to really test the tools

Takeaways

By attending the webinar you’ll receive:

  1. A recording of the webinar
  2. A one-page checklist to use during your evaluation
  3. A presentation template you can use to help “sell” your tool of choice to management

Register Today

Reserve your seat at the webinar now >>

post

Easy Data Wrangling with DSS: From Scraping HTML To Unsupervised Learning in 1h

Dataiku’s Data Science Studio (DSS) makes data wrangling easy. During this talk, Henri will demonstrate how we can use DSS’s powerful tools to create a complete workflow from raw data to training models in 1h.

We will start by scraping data science related job listings in Washington DC. Then, we will download all of the company reviews and try to make sense of where is the best place to work by cleaning and parsing raw html, and ultimately performing unsupervised learning to see what topics come up!

Finally we will use DSS’s insight tool to create a web app using flask, html and javascript to explore the results.

Our Speaker: Henry Dwyer

Henri Dwyer is a data scientist and engineer working at Dataiku on building the best platform for data scientists. He received an MSc in Engineering from Columbia University in New York City, and a BS and an Ms in Engineering from Ecole Polytechnique in Paris. He now lives in Brooklyn, and is always keen on discovering new data science problems to solve.

post

Analyzing Semi-Structured Data At Volume In The Cloud

The Cloud, Mobile and Web Applications are producing semi-structured data at an unprecedented rate. IT professionals continue to struggle capturing, transforming, and analyzing these complex data structures mixed with traditional relational style datasets using conventional MPP and/or Hadoop infrastructures. Public cloud infrastructures such as Amazon and Azure provide almost unlimited resources and scalability to handle both structured and semi-structured data (XML, JSON, AVRO) at Petabyte scale. These new capabilities coupled with traditional data management access methods such as SQL allow organizations and businesses new opportunities to leverage analytics at an unprecedented scale while greatly simplifying data pipeline architectures and providing an alternative to the “data lake”.

Please join DWDC and Snowflake Computing for a discussion of these topics and a demonstration of this game changing technology. The demonstration will focus on analyzing structured and semistructured together using a commercially available cloud based platform and standards based SQL language to provide insights on large petabytes scale data sets.

Our Speaker: Kevin Bair

Kevin Bair is a Solution Architect with extensive experience working with both federal and large commercial organizations over the last 25 years. He has a background in application development, database and content management, virtualization, and operational analytics. His career includes 15 years working for IBM Software Group, ITIL certification, and development of a patent related to Big Data on a virtualized network. Kevin is currently a Solution Architect at Snowflake Computing helping clients and business partners develop enterprise class solutions on AWS using Snowflake’s Cloud-based Elastic Data Warehouse.