Showing posts with label mining. Show all posts
Showing posts with label mining. Show all posts

Friday, March 23, 2012

Link analysis

Does Microsoft plan to extend the number of Data Mining algorithms in AS in the future releases? The question is motivated by the task of so called "link analysis" where one should determine how data attributes are related to each other or how and in what extent they influence each other in probabilistic terms. A good solution would be to build a Bayesian Network which gives an insight to how data attributes are related by means of directed acyclic graph. But this approach is not yet implemented in AS2005.

Existing algorithms such as association rules or decision trees might be used but they are far from being ideal for this task (association rules are designed for determining frequent boolean sets in data like Name=Attribute, decision trees work good for classification tasks but perform poor by design for the tasks of revealing attributes direct and inderect influence).

It would be interesting to know what algorithms and approaches Microsoft plans to develop in the future.

We cannot comment on algorithms that will appear in future versoins other than what has already been announced. In SQL Server 2008 we are introducing ARIMA time series with a default option to combine both ARTXP and ARIMA models to get the best of both approached.

The July CTP of SQL Server 2008 has this algorithm available.

Link analysis

Does Microsoft plan to extend the number of Data Mining algorithms in AS in the future releases? The question is motivated by the task of so called "link analysis" where one should determine how data attributes are related to each other or how and in what extent they influence each other in probabilistic terms. A good solution would be to build a Bayesian Network which gives an insight to how data attributes are related by means of directed acyclic graph. But this approach is not yet implemented in AS2005.

Existing algorithms such as association rules or decision trees might be used but they are far from being ideal for this task (association rules are designed for determining frequent boolean sets in data like Name=Attribute, decision trees work good for classification tasks but perform poor by design for the tasks of revealing attributes direct and inderect influence).

It would be interesting to know what algorithms and approaches Microsoft plans to develop in the future.

We cannot comment on algorithms that will appear in future versoins other than what has already been announced. In SQL Server 2008 we are introducing ARIMA time series with a default option to combine both ARTXP and ARIMA models to get the best of both approached.

The July CTP of SQL Server 2008 has this algorithm available.

sql

Wednesday, March 21, 2012

linear regression with nested explanation variable

We are trying to create a model of linear regression with nested table. We used the create mining model sintax as follow :

create mining model rate_plan3002_nested2

( CUST_cycle LONG KEY,

VOICE_CHARGES double CONTINUOUS predict,

DUR_PARTNER_GRP_1 double regressor CONTINUOUS ,

nested_taarif_time_3002 table

( CUST_cycle long CONTINUOUS,

TARIFF_TIME text key,

TARIFF_VOICE_DUR_ALL double regressor CONTINUOUS

)

) using microsoft_linear_regression

INSERT INTO MINING STRUCTURE [rate_plan3002_nested2_Structure]

(CUST_cycle ,

VOICE_CHARGES ,

DUR_PARTNER_GRP_1 ,

[nested_taarif_time_3002](SKIP,TARIFF_TIME ,TARIFF_VOICE_DUR_ALL)

)

SHAPE {

OPENQUERY([Cell],

'SELECT CUST_cycle ,

VOICE_CHARGES ,

DUR_PARTNER_GRP_1

FROM dbo.panel_anality_3002

order by CUST_cycle ')}

APPEND

({OPENQUERY([Cell],

'select CUST_cycle,

TARIFF_TIME,

CYCLE_DATE

from dbo.nested_taarif_time_3002

order by CUST_cycle,TARIFF_TIME')

}

relate CUST_cycle to CUST_cycle

) as nested_taarif_time_3002

The results we got are a model with intercept only. if we don't use the nested variable (the red line) we get a rigth model . (we had more variable ....)

Is there a way to do this regression correctly?

Thanks,

Dror

Hi Dror,

You could remove the "regressor" flag from the nested table column (in the create mining model statement) if this column is not indented to be part of the regression equation.

Thanks,

Dana Cristofor

|||

Thanks Dana,

the problem is that ido want it to be part of the regression

otherwise i don't need the nested table

Thanks

|||

I'm investigating an issue - let me work on this and get back to you.

Thanks

|||

I think I may have found the problem. I created a simple model with processed and had expected regressions. I then created the same model using a nested table and processed and got a constant result back - very confusing. I then tried creating an additional model in the nested structure that would be identical to the first non-nested model I created, and again got no regressions - startling!

What I found out was that for some reason (possibly a bug) when I added a nested table, the wizard did not add the "regressor" flag to any of my continuous inputs. Once I manually added the regressor flag and reprocessed, I got the expected regressions in my output.

Please check the regressor flag on the model columns and let me know if this helps for you. To set the regressor flag, go to the Mining Models tab of the Data Mining designer, click the cell representing the input column under the mining model (not the mining structure) and view the properties. The regressor flag is a possible option for the mining model column.

Thanks

-Jamie

|||

Thanks Jamie,

we added the regressor flag in the minig models tab, and the we got the same results:

when we add the regressor flag to the column in the nested table, we get an intercept only model.

if we put the regressor flag on the nested table only, and not on the column of the table, we get the same regression as if we didn't use the nested table.

is there a way to solve it?

we would also like to run the model from the DMXquery of the management studio, is there a way to get the script of the model from visual studio?

Thanks,

Dror

|||

I wonder if it's possible that there's too much "noise" with the nested table? I've tried with a degenerate case to prove that there's no obvious bug in the software preventing nested regressors from working. Can you also try the same model with the decision tree algorithm and see what happens?

You can get the DMX form of the model by following the instructions at the tip and trick here: http://www.sqlserverdatamining.com/DMCommunity/TipsNTricks/3652.aspx